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Preface 





Welcome to the UltraSPARC T1 Processor Supplement, D2.0. This document contains 
information about the processor-specific aspects of the architecture and 
programming of the UltraSPARC T1 processor, one of Sun Microsystems' family 
processors compliant with UltraSPARC Architecture", It is intended to supplement 
the UltraSPARC Architecture 2005 with processor-specific information. 





Target Audience 


This User's Guide is mainly targeted for programmers who write software for the 
UltraSPARC T1 processor. This manual contains a depository of information that is 
useful to operating system programmers, application software programmers and 
logic designers, who are trying to understand the architecture and operation of the 
UltraSPARC T1 processor. This manual is both a guide and a reference manual for 
programming of the processor. 





Fonts and Notational Conventions 


Fonts are used as follows: 


m Italic font is used for emphasis, book titles, and the first instance of a word that is 
defined. 


m Italic font is also used for terms where substitution is expected, for example, 


Hou 


“fcen”, "virtual processor n", or "reg plus imm". 


m italic sans serif font is used for exception and trap names. For example, “The 
privileged action exception...” 


m lowercase helvetica font is used for register field names (named bits) and 
instruction field names, for example: "The rs1 field contains..." 


UPPERCASE HELVETICA font is used for register names; for example, FSR.‏ א 


₪ TYPEWRITER (Courier) font is used for literal values, such as code (assembly 
language, C language, ASI names) and for state names. For example: $£0, 
ASI PRIMARY, execute state. 

















m When a register field is shown along with its containing register name, they are 
separated by a period (’.’), for example, FSR.cexc. 


m UPPERCASE words are acronyms or instruction names. Some common acronyms 
appear in the glossary. Note: Names of some instructions contain both upper- and 
lower-case letters. 


m Anunderscore character joins words in register, register field, exception, and trap 
names. Note: Such words may be split across lines at the underbar without an 
intervening hyphen. For example: “This is true whenever the integer condition | 
code field..." 


The following notational conventions are used: 


m The left arrow symbol ( — ( is the assignment operator. For example, "PC < PC + 
1" means that the Program Counter (PC) is incremented by 1. 


m Square brackets ( [ ] ) are used in two different ways, distinguishable by the 
context in which they are used: 


» Square brackets indicate indexing into an array. For example, TT[TL] means the 
element of the Trap Type (TT) array, as indexed by the contents of the Trap 
Level (TL) register. 


= Square brackets are also used to indicate optional additions/extensions to 
symbol names. For example, "ST[D,O]F" expands to all three of "STF", 
”STDF”, and "STOF". Similarly, ASI PRIMARY[ LITTLE] indicates two 
related address space identifiers, ASI PRIMARY and ASI PRIMARY LITTLE. 
(Contrast with the use of angle brackets, below) 








m Angle brackets ) > > ) indicate mandatory additions/extensions to symbol names. 
For example, "ST«D | Q>F” expands to mean "STDF" and "51027". (Contrast with 
the second use of square brackets, above) 


m Curly braces ( { } ) indicate a bit field within a register or instruction. For example, 
CCR{4} refers to bit 4 in the Condition Code Register. 


m A consecutive set of values is indicated by specifying the upper and lower limit of 
the set separated by a colon ( : ), for example, CCR{3:0} refers to the set of four 
least significant bits of register CCR. (Contrast with the use of double periods, 
below) 


m A double period ( .. ) indicates any single intermediate value between two given 
end values is possible. For example, NAME[2..0] indicates four forms of NAME 
exist: NAME, NAME2, NAMEI, and NAMEO; whereas NAME<2..0> indicates 
that three forms exist: NAME2, NAME1, and NAMEO. (Contrast with the use of 
the colon, above) 
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m A vertical bar ( | ) separates mutually exclusive alternatives inside square 
brackets ( [ ] ), angle brackets ( > > ), or curly braces ( ( ] ). For example, 
"NAME[A | B]" expands to "NAME, NAMEA, NAMEP" and “NAME<A | B>” 
expands to 'NAMEA, NAMEB". 


m The asterisk ( * ) is used as a wild card, encompassing the full set of valid values. 
For example, FCMP* refers to FCMP with all valid suffixes (in this case, 
FCMP<s | d | q> and FCMPEcs |d | .(<ף‎ An asterisk is typically used when the full 
list of valid values either is not worth listing (because it has little or no relevance 
in the given context) or the valid values are too numerous to list in the available 
space. 


m The slash ( / ) is used to separate paired or complementary values in a list, for 
example, "the LDBLOCKF/STBLOCKF instruction pair ...." 


m The double colon (::) is an operator that indicates concatenation (typically, of bit 
vectors). Concatenation strictly strings the specified component values into a 
single longer string, in the order specified. The concatenation operator performs 
no arithmetic operation on any of the component values. 





Notation for Numbers 


Numbers throughout this specification are decimal (base-10) unless otherwise 
indicated. Numbers in other bases are followed by a numeric subscript indicating 
their base (for example, 10015, FFFF 000016). In some cases, numbers may be 
preceded by "0x" to indicate hexadecimal (base-16) notation (for example, 
OxFFFF 0000). Long binary and hexadecimal numbers within the text may have 
spaces inserted every four characters to improve readability. 


An en dash ( - ) with no spaces indicates a range, for example, 000116-000016- 


Also see the colon ( : ) and double period ( .. ) notation described in the previous 
section. 





Informational Notes 


This manual provides several different types of information in notes, as follows: 


Note | General notes contain incidental information relevant to the 
paragraph preceding the note. 


Programming | Programming notes contain incidental information about how 
Note | software can use an architectural feature. 


Implementation | An Implementation Note contains incidental information, 
Note | describing how an UltraSPARC Architecture processor might 
implement an architectural feature. 


V9 Compatibility | Note containing information about possible differences between 

Note | UltraSPARC Architecture and SPARC V9 implementations. Such 
information may not pertain to other SPARC V9 
implementations. 
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CHAPTER 1 


UltraSPARC T1 Basics 








1.1 


Background 


UltraSPARC T1 is the first chip multiprocessor that fully implements Sun’s 
Throughput Computing initiative. Throughput Computing is a technique that takes 
advantage of the thread-level parallelism that is present in most commercial 
workloads. Unlike desktop workloads, which often have a small number of threads 
concurrently running, most commercial workloads achieve their scalability by 
employing large pools of concurrent threads. 


Historically, microprocessors have been designed to target desktop workloads, and 
as a result have focused on running a single thread as quickly as possible. Single 
thread performance is achieved in these microprocessors by a combination of 
extremely deep pipelines (over 20 stages in Pentium 4) and by executing multiple 
instructions in parallel (referred to as instruction-level parallelism, or ILP). The basic 
tenet behind Throughput Computing is that exploiting ILP and deep pipelining has 
reached the point of diminishing returns and as a result, current microprocessors do 
not utilize their underlying hardware very efficiently. 


For many commercial workloads, the physical processor core will be idle most of the 
time waiting on memory, and even when it is executing it will often be able to only 
utilize a small fraction of its wide execution width. So rather than building a large 
and complex ILP processor that sits idle most of the time, a number of small, single- 
issue physical processor cores that employ multithreading are built in the same chip 
area. Combining multiple physical processors cores on a single chip with multiple 
hardware-supported threads (strands) per physical processor core, allows very high 
performance for highly threaded commercial applications. This approach is called 
thread-level parallelism (TLP). The difference between TLP and ILP is shown in 
FIGURE 1-1. 


Strand 1 


Strand 2 


TLP 
Strand 3 


Strand 4 


Lp Single sand cn PNIS ששששששששש‎ NEN שמש‎ 


2 instructions per oce pp IE NENNEN VE 


E Executing [ERE Stalled on Memory 


FIGURE 1-1 Differences Between TLP and ILP 


The memory stall time of one strand can often be overlapped with execution of other 
strands on the same physical processor core, and multiple physical processor cores 
run their strands in parallel. In the ideal case, shown in FIGURE 1-1, memory latency 
can be completely overlapped with execution of other strands. In contrast, 
instruction-level parallelism simply shortens the time to execute ke and 
does not help much in overlapping execution with memory latency.! 


Given this ability to overlap execution with memory latency, why don't more 
processors utilize TLP? The answer is that designing processors is 8 mostly 
evolutionary process, and the ubiquitous deeply pipelined, wide ILP physical 
processor cores of today are the evolutionary outgrowth from a time when the CPU 
was the bottleneck in delivering good performance. 


With physical processor cores capable of multiple-GHz clocking, the performance 
bottleneck has shifted to the memory and I/O subsystems and TLP has an obvious 
advantage over ILP for tolerating the large I/O and memory latency prevalent in 
commercial applications. Of course, every architectural technique has its advantages 
and disadvantages. The one disadvantage of employing TLP over ILP is that 
execution of a single strand may be slower on a TLP processor than an ILP 
processor. With physical processor cores running at frequencies well over one GHz, 
a strand capable of executing only a single instruction per cycle is fully capable of 
completing tasks in the time required by the application, making this disadvantage 8 
non-issue for nearly all commercial applications. 


1- Processors that employ out-of-order ILP can overlap some memory latency with execution. However, this 
overlap is typically limited to shorter memory latency events such as L1 cache misses that hit in the L2 cache. 
Longer memory latency events such as main memory accesses are rarely overlapped to a significant degree 
with execution by an out-of-order processor. 
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1.2 


UltraSPARC T1 Overview 


UltraSPARC T1 is a single-chip multiprocessor. UltraSPARC T1 contains eight 
SPARC® physical processor cores. Each SPARC physical processor core has full 
hardware support for four virtual processors (or “strands”). These four strands run 
simultaneously, with the instructions from each of the four strands executed round- 
robin by the single-issue pipeline. When a strand encounters a long-latency event, 
such as a cache miss, it is marked unavailable and instructions will not be issued 
from that strand until the long-latency event is resolved. Round-robin execution of 
the remaining available strands will continue while the long-latency event of the 
first strand is resolved. 


Each SPARC physical core has a 16-Kbyte, 4-way associative instruction cache (32- 
byte lines), and 8K-byte, 4-way associative data cache (16-byte lines) that are shared 
by the four strands. The eight SPARC physical cores are connected through a 
crossbar to an on-chip unified 3-Mbyte, 12-way associative L2 cache (with 64-byte 
lines). The L2 cache is banked 4 ways to provide sufficient bandwidth for the eight 
SPARC physical cores. 





1.3 


I» 


UltraSPARC T1 Components 


This section describes each component in UltraSPARC T1: 


m SPARC physical core 
m Floating-point unit 
₪ cache 


SPARC Physical Core 


Each SPARC physical core has hardware support for four strands. This support 
consists of a full register file (with eight register windows) per strand, with most of 
the ASI, ASR, and privileged registers replicated per strand. The four strands share 
the instruction and data caches. 


FIGURE 1-2 illustrates SPARC physical core. 









Strand 


Instructio Strand 
Register Scheduler’ 




















D-Cache 


Register Files 














External 


Interface 


FIGURE 1-2 SPARC Core Block Diagram 


1.3.2 Floating-Point Unit (FPU) 


A single floating-point unit is shared by all eight SPARC physical cores. The shared 
floating-point unit is sufficient for most commercial applications, in which fewer 
than 1% of instructions typically involve floating-point operations. 


1.3.3 L2 Cache 


The L2 cache is banked four ways, with the bank selection based on address bits 7:6. 
The cache is 3 Mbytes, 12-way set associative, and has a line size of 64 bytes. 
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CHAPTER 2 


Data Formats 





The UltraSPARC T1 processor supports all UltraSPARC Architecture 2005 data 
formats; see the Data Formats chapter of the UltraSPARC Architecture 2005 for 
details. 


6 UltraSPARC T1 Supplement * Draft D2.0, 17 Mar 2006 


CHAPTER 3 


Registers 





This chapter discusses the specifics of UltraSPARC T1 registers, as they differ from 
the register definitions in UltraSPARC Architecture 2005. 





3.1 


41 


3.1.2 


3.1.3 


Ancillary State Registers (ASRs) 


TICK Register 


See the UltraSPARC Architecture 2005 for a general description of this register. 


The TICK register contains two fields: npt and counter. On an UltraSPARC 1 
processor, the npt field is replicated per strand, while the counter field is shared by 
all four strands on a physical processor core. The counter increments each physical 
processor core clock but, on an UltraSPARC T1 processor, the least significant 2 bits 
of the counter field always read as 0. 


General Status Register (GSR) 


Each strand has a nonprivileged General Status register (GSR), as described in the 
UltraSPARC Architecture 2005. 


All UltraSPARC Architecture 2005 GSR fields are supported in the UltraSPARC 1 
implementation. However, the mask and scale fields are not directly written by VIS 
instructions; they are provided for use by software emulation. 


Software Interrupt Register (SOFTINT) 


Each strand has a privileged software interrupt register, as described in the 
UltraSPARC Architecture 2005. 


3.1.4 


3.1.5 


3.1.6 


The software interrupt register contains three fields: sm, int level, and tm. Setting 
any of sm, tm, or SOFTINT{14} generates an interrupt level 14 exception. However, 
these bits are considered completely independent of each other. Thus, a Stick 
Compare event will only set bit 16 and generate interrupt level 14 exception, not 
also set bit 14. 


UltraSPARC T1 | It is possible (but difficult) in UltraSPARC T1 for software to 

Programming | clear a SOFTINT bit between the setting of that bit and the 

Note | generation of the interrupt from the bit being set because (there 

is a three-cycle window between the setting of the bit and the 
interrupt in UltraSPARC T1). If software were to do this, it 
would see an interrupt level n interrupt, but would find no bit 
set in the SOFTINT register. Note that normal software would 
only clear a bit in response to taking the interrupt level n 
exception, so this race condition should not occur in normal 
operation. 


UItraSPARC T1 | It is possible, but even more difficult than the above case, for software 
Programming | to zero a SOFTINT bit as it is getting set to 1, while another core is 
Note | accessing its SOFTINT register, with timing such that hardware 
decides to take a SOFTINT trap, but the SOFTINT register is clear by 
the time it decides the trap number. In this case, hardware will take a 
trap 4016. Since software should only clear a bit that is known to be 
set, this should never happen in normal operation. 





Tick Compare Register (TICK CMPR) 


Each strand has a privileged Tick Compare (TICK CMPR) register, as described in 
the UltraSPARC Architecture 2005. 


System Tick Register (STICK) 


On an UltraSPARC T1 processor, the STICK register is an alias for the TICK register. 
Writes to STICK will be reflected in TICK, and vice versa. See the description of TICK 
above for the behavior of this register. 


System Tick Compare Register (STICK CMPR) 


Each strand has a privileged System Tick Compare (STICK CMPR) register, as 
described in the UltraSPARC Architecture 2005. 
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3.1.7 PCR and PIG Registers 


TABLE 3-1 UltraSPARC T1-Specific Performance Instrumentation Registers 








ASR Replicated 
Number ASR Name Access priv by Strand Description 
za Dan. >; en "So 1, O 
1016 PCR RW Y Y Performance counter control register 
1146 PIC RW y! Y Performance Instrumentation Counter 
register 
Notes: 


1. Nonprivileged access with PCR.priv = 1 causes a privileged action exception. 


2. Nonprivileged access causes a privileged opcode exception. 





9.2 PR State Registers 


3.2.1 Trap State (TSTATE) 


Each virtual processor (strand) has MAXPTL(2) Trap State (TSTATE) registers, as 
described in the UltraSPARC Architecture 2005. 


502 Processor State Register (PSTATE) 


Each virtual processor (strand) has a Processor State register, as described in the 
UltraSPARC Architecture 2005. 


3:23 Trap Level Register (TL) 


Each virtual processor (strand) has a Trap Level register, as described in the 
UltraSPARC Architecture 2005. 


The maximum trap level (MAXPTL) for UltraSPARC 11 is 2. 


3.2.4 Global Level Register (GL) 


Each virtual processor (strand) has a Global Level register, as described in the 
UltraSPARC Architecture 2005. 


The maximum global level (MAXPGL) for UltraSPARC T1 is 2. 





3.3 Floating-Point State Register (FSR) 


Each virtual processor (strand) has a Floating-Point State register, FSR, as described 
in the UltraSPARC Architecture 2005. 


UltraSPARC T1 does not provide a nonstandard floating-point mode, so the ns field 
of FSR is always 0. 


On UltraSPARC 11, FSR.ver always reads as 0. 


FSR.qne always reads as 0, because UltraSPARC T1 neither needs nor supports a 
floating-point queue (FQ). 
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CHAPTER 4 


Instruction Set Overview 





The UltraSPARC T1 processor implements the instruction set described in the 
UltraSPARC Architecture 2005. Additional UltraSPARC T1-specific details are 
described in this chapter. 





4.1 


State Register Access 


UltraSPARC T1 supports the standard ASRs described in the UltraSPARC 
Architecture 2005. 





4.2 


Floating-Point Operate (FPop) 
Instructions 


UltraSPARC T1 implements the floating-point instruction set described in the 
UltraSPARC Architecture 2005. 


UltraSPARC T1 generates the correct IEEE Std 754-1985 results (impl. dep. #3). 


All floating-point quad-precision operations cause an fp_exception_other trap with 
FSR.ftt = unimplemented_FPop, and system software must emulate those 
operations. 





4.3 Reserved Opcodes and Instruction 
Fields 


An attempt to execute an opcode to which no instruction is assigned causes a trap. 

Specifically: 

m Attempting to execute a reserved FPop causes an fp exception other trap (with 
FSRftt = unimplemented FPop). 

m Attempting to execute any other reserved opcode causes an illegal instruction 
trap. 


m Attempting to execute a Tcc instruction with a nonzero value in the reserved field 
(bits 10:8 and 6:5 when i = 0 or bits 10:7 when i = 1) causes an illegal instruction 
trap. See Trap on Integer Condition Codes (Tcc) on page 16. 


See Appendix C, Opcode Maps, for a complete enumeration of the opcode 
assignments. 





4.4 Register Window Management 


UltraSPARC 11 (impl. dep. #2-V8). The state of the eight‏ תס 8 = REG WINDOWS‏ א 
register windows is determined by the contents of the set of privileged registers‏ 
described in the UltraSPARC Architecture 2005.‏ 


12 UItraSPARC T1 Supplement * Draft D2.0, 17 Mar 2006 


CHAPTER 5 


Instruction Definitions 








Bil Instruction Set Summary 


The UltraSPARC T1 CPU implements both the standard UltraSPARC Architecture 
2005 instruction set and a number of implementation-dependent extended 
instructions. Standard UltraSPARC Architecture 2005 instructions are documented in 
the UltraSPARC Architecture 2005. UltraSPARC T1 extended instructions are 
documented in VIS Instructions on page 16. 


The superscripts and their meanings are defined in TABLE 5-1. 


TABLE 5-1 Instruction Superscripts 


Superscript Meaning 
D Deprecated instruction 
P Privileged instruction 





UltraSPARC T1 executes most UltraSPARC Architecture 2005 instructions in 
hardware. Those that trap and are emulated in software are listed in TABLE 5-2. 


TABLE 5-2 UltraSPARC Architecture 2005 Instructions Not Directly Implemented by UltraSPARC T1 
Hardware (1 of 3) 





Exception Caused by 


Instruction Description Attempted Execution 
ALLCLEAN Mark all windows as clean illegal_instruction 
ARRAY{8,16,32} 3-D address to blocked byte address conversion illegal_instruction 
BMASK Write the GSR.mask field illegal_instruction 
BSHUFFLE Permute bytes as specified by the GSR.mask field illegal_instruction 
EDGE{8,16,32}{L}{N}_ Edge boundary processing {little-endian} {non-condition-code illegal instruction 
altering} 
FABSq Floating-point absolute value quad fp exception other 


[unimplemented, FPop] 


13 








TABLE 5-2 UltraSPARC Architecture 2005 Instructions Not Directly Implemented by UltraSPARC 1 
Hardware (2 of 3) 
Exception Caused by 
Instruction Description Attempted Execution 
FADDq Floating-point add quad fp exception other 
[unimplemented, FPop] 
FCMPq Floating-point compare quad fp exception other 
[unimplemented, FPop] 
FCMPEq Floating-point compare quad (exception if unordered) fp exception other 


FCMPEQ116,32) 
FCMPGT{16,32} 
FCMPLE({16,32} 
FCMPNE(16,32} 
FDIVq 





FdMULq 


FEXPAND 
FiTOq 


FMOVq 
FMOVacc 
FMOVqr 
FMULq 
FMUL8SUx16 
FMUL8ULx16 


FMUL8x16 
FMUL8x16AL 


FMUL8x16AU 


FMULD8SUx16 





FMULDSULx16 





FNEGq 


Four 16-bit / two 32-bit compare: set integer dest if srcl = 2 
Four 16-bit / two 32-bit compare: set integer dest if srcl > src2 
Four 16-bit / two 32-bit compare: set integer dest if src1 > 2 
Four 16-bit / two 32-bit compare: set integer dest if 5101  src2 
Floating-point divide quad 


Floating-point multiply double to quad 


Four 8-bit to 16-bit expand 


Convert integer to quad floating-point 
Floating-point move quad 
Move quad floating-point register if condition is satisfied 


Move quad floating-point register if integer register contents 
satisfy condition 


Floating-point multiply quad 


Signed upper 8- x 16-bit partitioned product of corresponding 
components 


Unsigned lower 8-bit x 16-bit partitioned product of 
corresponding components 


8- x 16-bit partitioned product of corresponding components 
Signed lower 8-bit x 16-bit lower ₪ partitioned product of four 
components 

Signed upper 8-bit x 16-bit lower ₪ partitioned product of four 
components 


Signed upper 8-bit x 16-bit multiply >- 32-bit partitioned product 


of components 


Unsigned lower 8-bit x 16-bit multiply >- 32-bit partitioned 
product of components 


Floating-point negate quad 
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[unimplemented, FPop] 
illegal instruction 
illegal instruction 
illegal instruction 
illegal instruction 


fp exception other 
[unimplemented FPop] 


fp exception other 
[unimplemented, FPop] 


illegal instruction 


fp exception other 
[unimplemented, FPop] 


fp exception other 
[unimplemented, FPop] 


fp exception other 
[unimplemented, FPop] 


fp exception other 
[unimplemented, FPop] 


fp exception other 
[unimplemented, FPop] 
illegal instruction 


illegal instruction 


illegal instruction 
illegal instruction 


illegal instruction 
illegal instruction 
illegal instruction 


fp exception other 
[unimplemented, FPop] 








TABLE 5-2 UltraSPARC Architecture 2005 Instructions Not Directly Implemented by UltraSPARC 1 
Hardware (3 of 3) 
Exception Caused by 
Instruction Description Attempted Execution 
FPACKFIX Two 32-bit to 16-bit fixed pack illegal instruction 
FPACK{16,32} Four 16-bit/two 32-bit pixel pack illegal_instruction 
FPMERGE Two 32-bit to 64-bit fixed merge illegal_instruction 
FSQRT(s,d,q) Floating-point square root fp exception other 
[unimplemented, FPop] 
F(s,d,q)TO(q) Convert between floating-point formats to quad fp exception other 
[unimplemented, FPop] 
FqTOi Convert quad floating point to integer fp exception other 
[unimplemented, FPop] 
FqTOx Convert quad floating point to 64-bit integer fp exception other 
[unimplemented, FPop] 
FSUBq Floating-point subtract quad fp exception other 
[unimplemented, FPop] 
FxTOq Convert 64-bit integer to floating-point fp exception other 
[unimplemented, FPop] 
IMPDEP1 Implementation-dependent instruction illegal_instruction 
IMPDEP2 Implementation-dependent instruction illegal_instruction 
INVALWP Mark all windows as CANSAVE illegal instruction 
LDOF Load quad floating-point illegal instruction 
LDOFA Load quad floating-point into alternate space illegal instruction 
LDSHORTF Short FP load, zero-extend 8/16-bit load to a double-precision data access exception 
floating-point register 
NORMALW Mark other windows as restorable illegal instruction 
OTHERW Mark restorable windows as other illegal instruction 
PDIST Distance between eight 8-bit components illegal instruction 
POPC Population count illegal instruction 
PST Eight 8-bit/four 16-bit/two 32-bit partial stores data access exception 
SHUTDOWNP? Shut down illegal_instruction 
STBLOCKF 64-byte block store with commit data_access_exception 
STOF Store quad floating-point illegal instruction 
STOFA Store quad floating-point into alternate space illegal instruction 
STSHORTF Short FP store, 8-/16-bit store from 8 double-precision data access exception 


floating-point register 








9.2 


Prefetch and Prefetch from Alternate 
Space 


PREFETCH and PREFETCHA with fcn codes of 0-3 and 16-23 (1016-1716) are 
implemented; all map to the same operation that brings the cache line into the L2 
cache. On an MMU miss, the prefetch is dropped (weak prefetching). 


Prefetch fcn codes 515—F1g cause an illegal instruction trap. These operations are all 
“weak” prefetches; in some cases the prefetch operation is dropped. 





3.3 


Trap on Integer Condition Codes (Tec) 


See the UltraSPARC Architecture 2005 for a complete description of the Tec 
instruction. 


UltraSPARC T1 | For the i = 0 variant of 166, UltraSPARC T1 does not check that 
Implementation | reserved instruction bit 7 is 0. If bit 7 is set to 1 with i = 0, 
Note | UltraSPARC T1 treats it as a valid 166 instruction. 





5.4 


VIS Instructions 


UltraSPARC T1 supports in hardware the VIS 2 SIAM instruction and a subset of the 
VIS 1 instructions. 


All other VIS 1 and VIS 2 instructions (see TABLE 5-2 for a list) cause an 
illegal instruction exception on UltraSPARC T1 and are emulated in software. 


UItraSPARC T1 | The use of VIS instructions on UltraSPARC T1 is strongly 

Programming | discouraged; the performance of even the implemented VIS 

Note | instructions will often be below that of a comparable set of non- 

VIS instructions. This includes the block load and block store 
instructions. An UltraSPARC T1 physical processor core (four 
virtual processors) can only have a single outstanding floating- 
point operation (including block load, block store, and VIS 
instructions) in progress at any given time. 
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5.5 


Partitioned Add/Subtract Instructions 


See the UltraSPARC Architecture 2005 for detailed descriptions of the FPADD and 
FPSUB instructions. 


UltraSPARC T1 | For good performance on UltraSPARC T1, the result of a single 
Programming | FPADD should not be used as part of a 64-bit graphics 
Note | instruction source operand in the next instruction group. 


Similarly, the result of a standard FPADD should not be used as 
a 32-bit graphics instruction source operand in the next 
instruction group. 





5.6 


Align Data 


See the UltraSPARC Architecture 2005 for detailed descriptions of the FALIGNDATA 
instruction. 


UltraSPARC דד‎ | For good performance on UltraSPARC T1, the result of 
Programming | FALIGNDATA should not be used as the source operand of a 32- 
Note | bit SIMD instruction in the next instruction group. 





DUE 


F Register Logical Operate Instructions 


See the UltraSPARC Architecture 2005 for a description of the F register logical 
operate instructions (1-, 2-, and 3-operand). 


UltraSPARC T1 | For good performance on UltraSPARC T1, the result of a single 
Programming | logical operate instruction should not be used as part of the 
Note | source operand of a 64-bit SIMD instruction in the next 
instruction group. 


Similarly, the result of a standard logical operate instruction 
should not be used as the source operand of a 32-bit SIMD 
instruction source operand in the next instruction group. 








5.8 Block Load and Store Instructions 


For architectural descriptions of the LDBLOCKF and STBLOCKF instructions, see 
the UltraSPARC Architecture 2005. 


UltraSPARC T1 | On UltraSPARC TI, a block load forces a miss in the primary 

Implementation | cache and will not allocate a line in the primary cache, but does 
Note | allocate in the L2 cache. On UltraSPARC 11, block loads and 

stores from multiple virtual processors are not overlapped. 


Compatibility | These instructions were intended for use in transferring large 
Note | blocks of data (more than 256 bytes); for example, in BCOPY 
and BFILL operations. 


The use of block loads and stores on UltraSPARC T1 is 
deprecated; they are provided primarily for compatibility with 
existing software. UltraSPARC T1 provides a separate set of 
ASIs for high performance BCOPY and BFILL, as described in 
TABLE 9-1 on page 41. The performance of parallel BCOPY using 
appropriate ASIs (from among 2246, 2316, E215, E315, EA46, and 
EB46) will be 2.5 to 3.5 times that of a BCOPY using block loads 
and stores. The performance of a single-threaded BCOPY using 
these ASIs will be 15% to 50% better than that of a BCOPY using 
block loads and stores. 





On UltraSPARC T1, to order an LDBLOCKF with respect to earlier stores, an 
intervening MEMBAR 4 Sync must be executed. 


Similarly on UltraSPARC T1, STBLOCKF source data registers are not interlocked 
against completion of previous load instructions (even if a second LDBLOCKF has 
been performed). The previous load data must be referenced by some other 
intervening instruction, or an intervening MEMBAR #Sync must be performed. If 
the programmer violates these rules, data from before or after the load may be used. 
UltraSPARC T1 continues execution before all of the store data has been transferred. 
If store data registers are overwritten before the next block store or MEMBAR #Sync 
instruction, then the following rule must be observed. The first register can be 
overwritten in the same instruction group as the STBLOCKF, the second register can 
be overwritten in the instruction group following the block store and so on. If this 
rule is violated, the store may store correct data or the overwritten data. Block stores 
always operate under the relaxed memory order (RMO) memory model, regardless 
of the PSTATE.mm setting, and require a subsequent MEMBAR #Sync to order them 
with respect to following loads. 
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After an STBLOCKT instruction but before executing a DONE, RETRY, or WRPR to 
PSTATE instruction, there must be an intervening MEMBAR t Sync or a trap. If this 
is rule is violated, instructions after the DONE, RETRY, or WRPR to PSTATE may 
not see the effects of the updated PSTATE. 


On UItraSPARC T1, LDBLOCKF does not follow memory model ordering with 
respect to stores. In particular, read-after-write and write-after-read hazards to 
overlapping addresses are not detected. The side-effects bit associated with the 
access is ignored (see Translation Table Entry (TTE) on page 53). If ordering with 
respect to earlier stores is important (for example, a block load that overlaps 
previous stores), then there must be an intervening MEMBAR #StoreLoad (or 
stronger MEMBAR). If ordering with respect to later stores is important (for 
example, a block load that overlaps a subsequent store), then there must be an 
intervening MEMBAR #LoadStore or reference to the block load data. This 
restriction does not apply when a trap is taken, so the trap handler need not 
consider pending block loads. If the LDBLOCKF overlaps a previous or later store 
and there is no intervening MEMBAR, trap, or data reference, the LDBLOCKF may 
return data from before or after the store. 


Compatibility | Prior UItraSPARC machines may have written loaded data into the 
Note | first two registers at the same time. Software that depends on this 
unsupported behavior must be modified for UltraSPARC TI. 


STBLOCKF does not follow memory model ordering with respect to loads, stores or 
flushes. In particular, read-after-write, write-after-write, flush-after-write, and write- 
after-read hazards to overlapping addresses are not detected. The side-effects bit 
associated with the access is ignored. If ordering with respect to earlier or later loads 
or stores is important, then there must be an intervening reference to the load data 
(for earlier loads), or appropriate MEMBAR instruction. This restriction does not 
apply when a trap is taken, so the trap handler does not have to worry about 
pending block stores. If the STBLOCKF overlaps a previous load and there is no 
intervening load data reference or MEMBAR #LoadStore instruction, the load may 
return data from before or after the store and the contents of the block are 
undefined. If the STBLOCKF overlaps a later load and there is no intervening trap or 
MEMBAR #StoreLoad instruction, the contents of the block are undefined. If the 
STBLOCKF overlaps a later store or flush and there is no intervening trap or 
MEMBAR #StoreStore instruction, the contents of the block are undefined. 


Block load and store operations do not obey the ordering restrictions of the currently 
selected virtual processor memory model (always TSO in UltraSPARC T1); block 
operations always execute under an RMO memory ordering model. Explicit 
MEMBAR instructions are required to order block operations among themselves or 
with respect to normal loads and stores. In addition, block operations do not 
conform to dependence order on the issuing strand; that is, no read-after-write or 
writer-after-read checking occurs between block loads and stores. Explicit 
MEMBARs must be used to enforce dependence ordering between block operations 
that reference the same address. 


Typically, LDBLOCKF and STBLOCKF are used in loops where software can ensure 
that there is no overlap between the data being loaded and the data being stored. 
The loop must be preceded and followed by the appropriate MEMBARs to ensure 
that there are no hazards with loads and stores outside the loops. CODE EXAMPLE 5-1 
illustrates the inner loop of a byte-aligned block copy operation. 


Note that the loop must be unrolled twice to achieve maximum performance. All FP 
register references in this code example are to 64-bit registers. Eight versions of this 
loop are needed to handle all the cases of double word misalignment between the 
source and destination. 


CODE EXAMPLE 5-1 Byte-Aligned Block Copy Inner Loop 





loop: 
faligndata %£0, %12, 4 
faligndata $f2, 514, 6 
faligndata %14, 516, ,8 
faligndata %£6, 518, 0 
faligndata $f8, 5110, 2 
faligndata $f10, %512, 4 
faligndata $f12, %f14, %£46 
addcc BAO 1, 210 
bg, pt 11 
fmovd %114, 8 

end of loop handling 

11: ldda [regaddr] HASI BLK P, %£0 
stda $£32, [regaddr] $ASI BLK P 
faligndata $f48, %116, 2 
faligndata %£16, %118, 4 
faligndata %£18, %£20, $f36 
faligndata $f20, $f22, %£38 
faligndata $f22, $f£24, %£40 
faligndata $f24, %£26, $f42 
faligndata $f26, %£28, %f44 
faligndata %£28, %£30, $f46 
addcc 10, sly 210 
be, pnt done 
fmovd %£30, %£48 
18 [regaddr] $ASI BLK P, $f16 
stda $£32, [regaddr] $ASI BLK P 
ba loop 
faligndata $f48, $f0, %f32 

done: end of loop processing 
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9:9 Block Initializing Store ASIs 


The Block Initializing Store ASIs are specific to the UltraSPARC T1 
implementation and are not guaranteed to be portable to other UltraSPARC 
Architecture implementations. They should only appear in platform-specific 


dynamically-linked libraries or in code generated at runtime by software (for 
example, a just-in-time compiler) that is aware of the specific implementation 
upon which it is executing. 








ASI 
Instruction imm asi Value Operation Assembly Language Syntax 
ST{B,H,W,X,D}A ASI STBI AIUP 2216 64-byte block initialing store st (b, h,w,x,d)a reg, [reg addr] imm asi 


to primary address space, 5tí(b,h,w,x,d)a fegra, [reg plus imm] zasi 
user privilege 
ST{B,H,W,X,D}A ASI STBI AIUS 2316 64-byte block initialing store 
to secondary address space, 
user privilege 
ST{B,H,W,X,D}A ASI_STBI_N 2716 64-byte block initialing store 
to nucleus address space 
ST{B,H,W,X,D}A ASI STBI AIUPL L 2Aj¢ 64-byte block initialing store 
to primary address space, 
user privilege, little-endian 
ST{B,H,W,X,D}A ASI STBI AIUSL 2B46 64-byte block initialing store 
to secondary address space, 
user privilege, little-endian 
ST{B,H,W,X,D}A ASI STBI NL 276 64-byte block initialing store 
to nucleus address space, 
little-endian 





ST{B,H,W,X,D}A ASI STBI P E246 64-byte block initialing store 
to primary address space 

ST{B,H,W,X,D}A ASI STBI 5 E316 64-byte block initialing store 
to secondary address space 

ST{B,H,W,X,D}A ASI STBI PL EA46 64-byte block initialing store 


to primary address space, 
little-endian 
ST{B,H,W,X,D}A ASI STBI SL שומ‎ 64-byte block initialing store 
to secondary address space, 
little-endian 








Description 


The UltraSPARC T1-specific block initializing store instructions are selected by using 
one of the block-initializing ASIs with integer store alternate instructions. These ASIs 
allow block-initializing stores to be performed to the same address spaces as normal 
stores. Little-endian ASIs access data in little-endian format; otherwise, the access is 
assumed to be big-endian. 


Integer stores of all sizes are allowed with these ASIs, and STDA behaves as a 
standard store doubleword. All stores to these ASIs operate under relaxed memory 
ordering (RMO), regardless of the value of PSTATE.mm. Software must follow a 
sequence of these stores with a MEMBAR #Sync to ensure ordering with respect to 
subsequent loads and stores. 


A store to one of these ASIs where the least-significant 6 bits of the address are 
nonzero (that is, not the first word in the cache line) behaves the same as a normal 
store (with RMO ordering). 


A store to one of these ASIs where the least-significant 6 bits of the address are zero 
will load a cache line in the L2 cache with either all zeros or the existing memory 
data, and then update the beginning of the cache line with the new store data. This 
special store behavior ensures that the line maintains coherency when it is loaded 
into the cache, but will not generally fetch the line from memory (instead, 
initializing it with zeroes). 


A store using one of these ASIs to a noncacheable location behaves the same as a 
normal store. 


UltraSPARC T1 | On UltraSPARC T1, a noncacheable address is identified by . 
Implementation 
Note 


Programming | These instructions are particularly useful in combination with 
Note | load twin extended word instructions for transferring large 

blocks (more than 256 bytes) of data; for example, in 

implementing bcopy () and bfill() operations. 


UltraSPARC דד‎ | On UltraSPARC T1, block initializing stores and load twin 
Implementation | doublewords from multiple strands are fully overlapped. 
Note 





Attempted use of any of these ASIs by a floating-point store alternate instruction 
(STFA, STDFA) causes a data_access_exception exception. 


Access to any of these ASIs by an instruction with misaligned address causes a 
mem_address_not_aligned exception. 


22 UltraSPARC T1 Supplement * Draft D2.0, 17 Mar 2006 


Programming 
Note 





The following pseudocode shows how these ASIs can be used to 
do a quadword-aligned (on both source and destination) copy of 
N quadwords from A to B (where N » 3). Note that the final 64 
bytes of the copy is performed using normal stores, to guarantee 
that all initial zeros in a cache line are overwritten with copy 








data. 

$10 > [A]; $11 < [B] 

prefetch [$10 

for (i = 0; i > N-4; itt) ( 
if )!)1 % 4)) + prefetch ]%10+64[ } 
ldda [$10] HASI BLK INIT ST P, 12 
add $10, 16, $10 
stxa $12, [$11] HASI BLK INIT ST P 
add $11, 8, $11 
stxa $13, [$1148] 1פת+‎ BLK INIT ST P 
add $11, 8, $11 

} 

for (i = 0; i< 4; i++) { 
ldda [$10] #ASI_BLK_INIT_ST_P, $12 
add %10, 16, %10 
stx S12, [411] 
stx %13, [%11+8] 
add %11, 16, %11 

} 

membar #Sync 


An overlapped copy operation must avoid issuing a block-init 


store to 


a line before all loads from that line have been issued. 


Otherwise, one or more of the loads may see the interim "zero" 
side-effect value. This typically means that abs(A-B) must be 


64. 


Exceptions 


UltraSPARC T1 
Programming 
Notes 


VA watchpoint 





(1) These ASIs are specific to UltraSPARC T1, to provide a high- 
performance mechanism for BCOPY operations, as an 
alternative to legacy block load and block store instructions 
(which rely on the floating-point register file and thus are 
limited by the single register file port). These ASIs are only 
allowed in platform-specific dynamically linked libararies and 
in code generated at runtime by software (for example, a just-in- 
time compiler) that is aware of the implementation upon which 
it is executing. 


(2) These ASIs provide a higher performance bcopy () or 
bfill() than the block loads and stores described in 

Section 5.8, due to their ability to overlap multiple loads and 
stores between strands and to avoid the unnecessary fetch from 
memory of the data that is overwritten by the store. The 
performance of parallel bcopy () using these ASIs will be 2.5 to 
3.5 times that of a bcopy () using block loads and stores. The 
performance of a single-threaded bcopy () using these ASIs will 
be 15% to 50% better than that of a bcopy () using block loads 
and stores. 


mem address not aligned 
data access exception 
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5.10 Load Twin Extended Word Instructions 
(nonprivileged) 


The Load Twin Extended Word Instructions are not guaranteed to be portable to 
other UltraSPARC Architecture implementations. They should only appear in 


platform-specific dynamically-linked libraries or in code generated at runtime 
by software (for example, a just-in-time compiler) that is aware of the specific 
implementation upon which it is executing. 





Description Load Twin Extended Word instructions are new in the UltraSPARC Architecture 
2005; they are used to atomically read a 128-bit data item into a pair of integer 
registers. 


See the UltraSPARC Architecture 2005 for details. 


Programming | These instructions are particularly useful in combination with 

Note | block-initializing stores for transferring large blocks of data 
(more than 256 bytes); for example, in implementing bcopy() 
and bfill() operations. See the description of Block Initializing 
Stores for an example of how Load Twin Extended Word can be 
used in combination with those instructions. 


UltraSPARC דד‎ | On UltraSPARC T1, a load twin extended word forces a miss in 
Implementation | the primary cache and will not allocate a line in the primary 

Note | cache, but does allocate in L2. On UltraSPARC T1, block 
initializing stores and load twin doublewords from multiple 
strands are fully overlapped. 





See the description of Block Initializing Stores for an example of how Load Twin 
Doubleword can be used in combination with those instructions. 


UltraSPARC T1 | (1) These instructions, combined with store instructions using 
Programming | the UltraSPARC T1-specific Block Initializing Store ASIs, 

Notes provide a high-performance mechanism for BCOPY operations, 
as an alternative to legacy block load and store (which rely on 
the floating-point register file and thus are limited by the single 
register file port). These ASIs are only allowed in platform- 
specific dynamically linked libararies and in code generated at 
runtime by software (for example, a just-in-time compiler) that 
is aware of the implementation upon which it is executing. 


(2) These ASIs provide a higher performance bcopy () or 
bfill() than the block loads and stores described in 

Section 5.8, due to their ability to overlap multiple loads and 
stores between strands and to avoid the unnecessary fetch from 
memory of the data that is overwritten by the store. The 
performance of parallel bcopy () using these ASIs will be 2.5 to 
3.5 times that of a bcopy () using block loads and stores. The 
performance of a single-threaded bcopy () using these ASIs will 
be 15% to 50% better than that of a bcopy () using block loads 
and stores. 





See Also Block Initializing Store ASIs on page 21. 
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5.11 


Load Twin Extended Word Instructions 
(privileged) 








ASI 
Instruction imm asi Value Operation Assembly Language Syntax 
LDTX ASI, LDTX | 2716" 128-bit atomic load ldda [reg addr] imm asi, regra 
ldda [reg plus imm] %asi, regra 

LDTX ASI LDTX REAL 2616 128-bit atomic load, real 

addressing (RA(63:0] set to 

VA{63:0}) 
LDTX ASI_LDTX_NL 2Fi¢t 128-bit atomic load, little 

endian 
LDTX ASI_LDTX_REAL_L 216 128-bit atomic load, real 











addressing (RA(63:0] set to 
VA{63:0}), little endian 


* ASI 2446 (deprecated) is aliased to ASI 2716 in UltraSPARC T1. 
ASI 2C1, (deprecated) is aliased to ASI 2F46 in UltraSPARC 1. 


Description 


om | > | me | 


31 30 29 25 24 1918 14 13 12 54 0 


Compatibility | In previous UltraSPARC documents, these instructions were 
Note | (loosely) referred to as "Quad LDD"instructions. 


These instructions atomically read a 128-bit data item into two 64-bit integer 
registers. They are intended to be used to access TSB entries without requiring locks. 
The data is placed in an even/odd pair of 64-bit integer registers. The lowest address 
64 bits is placed in the even-numbered register; the highest address 64-bits is placed 
in the odd-numbered register. 





ASI LDTX REAL( L] bypasses the virtual-to-real portion of the translation, setting 
RA{63:0} = VA163:0). 


In addition to the usual exceptions for LDTX using a privileged ASI, a 

data access exception trap occurs if these ASIs are used with any instruction other 
than LDTX or LDDA (which share an opcode). A mem address not aligned trap is 
taken if the access is not aligned on a 128-bit boundary. 


Exceptions VA watchpoint 
mem address not aligned (Checked for opcode implied alignment if the opcode 
is not LDDA) 
data access exception 
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CHAPTER 6 


Traps 





The UltraSPARC T1 processor implements the trap model described in the 
UltraSPARC Architecture 2005. 


Additional UltraSPARC T1-specific details are described in this chapter. 





6.1 Trap Levels 


Each UltraSPARC T1 virtual processor supports two trap levels (MAXPTL = 2) . 
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CHAPTER 7 


Interrupt Handling 





7.0.1 


Interrupt Queue Registers 





Each strand has eight ASI_QUEUE registers at ASI 2516, VA{63:0} = 3C016-3F816 that 
are used for communicating interrupts to the privileged mode operating system. 
These registers contain the head and tail pointers for four supervisor interrupt 
queues: cpu mondo, dev mondo, resumable error, and nonresumable error. 











The tail registers are read-only. An attempted write to a tail register by privileged 
software generate a data access exception trap. The head registers are read/write. 


Whenever the contents of the CPU MONDO HEAD and CPU MONDO TAIL 
registers are unequal, a cpu mondo trap is generated. Whenever the contents of the 
DEV MONDO HEAD and DEV MONDO 7AIL registers are unequal, a dev mondo 
trap is generated. Whenever the contents of the RESUMABLE ERROR HEAD and 
RESUMABLE ERROR 7TAIL registers are unequal, a resumable error trap is 
generated. 


Unlike the other queue register pairs, the nonresumable error trap is not 
automatically generated by hardware whenever the contents of the 
NONRESUMABLE ERROR HEAD and NONRESUMABLE ERROR TAIL registers 
are unequal; instead, hyperprivileged software must make it appear to privileged 
software as if a nonresumable error trap has occured. 


Warning | There is a known "feature" in UltraSPARC T1 that affects 
LDXA/STXA by supervisor code to these ASI registers. If an 
immediately preceeding instruction is a store that takes certain 
traps, an LDXA can corrupt an unrelated IRF (integer register 
file) register, or a STXA may complete in spite of the trap. To 
prevent this, it is required to have a non-store or NOP instruction 
before any LDXA/STXA to these ASIs. If the LDXA/STXA is at 
a branch target, there must be a non-store in the delay slot. 
Nonprivileged software is not affected by this. 
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Programming 


Note 





These registers are intended to be used as head and tail pointers 
into 8 queue in memory storing the mondo or error interrupt 
data. When an interrupt is taken, the interrupt data are stored 
into the end of the appropriate queue. Then the corresponding 
tail register is updated to point beyond the new data, which 
causes a trap to be generated to privileged software (the 
operating system). Privileged software then processes the 
interrupt data from the head of the queue, updating the head 
register when the interrupt processing is completed. 


While the first interrupt is being serviced, more interrupts may 
be placed on the queue. The operating system can read the tail 
pointer to service multiple interrupts at a time, or it can simply 
update the head pointer after each interrupt has been serviced 

and take a trap for each interrupt. 


When all pending interrupts of the appropriate type have been 
serviced, the head and tail pointers will be equal again, and no 
further traps will be generated until new interrupt data is placed 
on the queue. 


TABLE 7-1 through TABLE 7-8 define the format of the eight interrupt queue registers. 
TABLE 7-1 CPU Mondo Head Pointer - QUEUE CPU MONDO HEAD (ASI 2516, VA 3C016) 





Bit 
63:14 
13:6 
5:0 


Field 


head 


R/W Description 


R Reserved 
RW Head pointer for CPU Mondo Interrupt Queue. 
R Reserved 





TABLE 7-2 CPU Mondo Tail Pointer - QUEUE CPU MONDO TAIL (ASI 2516, VA 3C816) 


Bit 
63:14 
13:6 
5:0 


Field 


tail 


R/W Description 


R Reserved 
RW Tail pointer for CPU Mondo Interrupt Queue. 
R Reserved 


TABLE 7-3 Device Mondo Head Pointer - QUEUE DEV MONDO HEAD (ASI 2516, VA 3D016) 





Bit Field R/W Description 

63:34 — R Reserved 

13:6 head RW Head pointer for Device Mondo Interrupt Queue. 
5:0 = R Reserved 
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TABLE 7-4 Device Mondo Tail Pointer - QUEUE DEV MONDO TAIL (ASI 2516, VA 328,0( 


Bit Field R/W Description 

63:14  — R Reserved 

13:6 tail RW Tail pointer for Device Mondo Interrupt Queue. 
5:0 — R Reserved 





TABLE 7-5 Resumable Error Head Pointer - QUEUE_RESUMABLE_HEAD (ASI 2546, 





VA 3E0146) 
Bit Field R/W Description 
63:14  — R Reserved 
13:6 head RW Head pointer for Resumable Error Queue. 
5:0 — R Reserved 


TABLE 7-6 Resumable Error Tail Pointer - QUEUE RESUMABLE TAIL (ASI 2546, 





VA 3E8416) 
Initial 
Bit Field Value R/W Description 
63:14 — R Reserved 
13:6 tail RW Tail pointer for Resumable Error Queue. 
5:0 — R Reserved 


TABLE 7-7 Nonresumable Error Head Pointer - QUEUE NONRESUMABLE HEAD (ASI 2516, VA 3F016) 





Initial 
Bit Field Value R/W Description 
63:14  — R Reserved 
13:6 head RW Head pointer for NonResumable Error Queue. 
5:0 — R Reserved 


TABLE 7-8 Nonresumable Error Tail Pointer - QUEUE NONRESUMABLE TAIL (ASI 2546, 








VA 3F819) 
Bit Field R/W Description 
63:34 — R Reserved 
13:6 tail RW Tail pointer for NonResumable Error Queue. 
5:0 — R Reserved 





34 UItraSPARC T1 Supplement * Draft D2.0, 17 Mar 2006 


CHAPTER 8 


Memory Models 








8.1 Overview 


SPARC V9 defines the semantics of memory operations for three memory models. 
From strongest to weakest, they are Total Store Order (TSO), Partial Store Order 
(PSO), and Relaxed Memory Order (RMO). The differences in these models lie in the 
freedom an implementation is allowed in order to obtain higher performance during 
program execution. The purpose of the memory models is to specify any constraints 
placed on the ordering of memory operations in uniprocessor and shared-memory 
multiprocessor environments. 


For a full description of the TSO memory model, see the UltraSPARC Architecture 
2005. 


UltraSPARC T1 supports only TSO, with the exception that accesses using certain 
ASIs (notably, block loads and block stores) may operate under RMO (impl. dep. 
#113-V9-Ms10). 


Although a program written for a weaker memory model potentially benefits from 
higher execution rates, it may require explicit memory synchronization instructions 
to function correctly if data is shared. MEMBAR is a memory synchronization 
primitive that enables a programmer to control explicitly the ordering in a sequence 
of memory operations. Processor consistency is guaranteed in all memory models. 


The current memory model is indicated in the PSTATE.mm field. Its value is always 
0 on UltraSPARC T1. An UltraSPARC T1 virtual processor always operates under the 
TSO memory model. 


Memory is logically divided into real memory (cached) and I/O memory 
(noncached, with and without side effects) spaces (impl. dep. #118-V9). Real memory 
spaces may be cached and can be accessed without side effects. For example, a read 
(load) from real memory space returns the information most recently written. In 
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addition, an access to real memory space does not result in program-visible side 
effects. In contrast, a read from I/O space may not return the most recently written 
information and may result in program-visible side effects. 





8.2 Supported Memory Models 


The following sections contain brief descriptions of the two memory models 
supported by UltraSPARC T1. These definitions are for general illustration. Detailed 
definitions of these models can be found in UltraSPARC Architecture 2005. The 
definitions in the following sections apply to system behavior as seen by the 
programmer. Å description of MEMBAR can be found in Section 8.3.2, "Memory 
Synchronization: MEMBAR and FLUSH" on page 72. 


Notes | (1) Stores to UltraSPARC T1 Internal ASIs, block loads, and 
block stores are outside the memory model; that is, they need 
MEMBARs to control ordering. See Section 8.3.8, “Instruction 
Prefetch to Side-Effect Locations" on page 79 and Section 13.5.3, 
“Block Load and Store Instructions" on page 172. 


(2) Atomic load-stores are treated as both a load and a store and 
can only be applied to cacheable address spaces. 





8.2.1 Total Store Order 


UltraSPARC T1 implements the following programmer-visible properties in Total 
Store Order (TSO) mode: 


Loads are processed in program order; that is, there is an implicit MEMBAR 
#LoadLoad between them. 


Loads may bypass earlier stores. Any such load that bypasses such earlier stores 
must check (snoop) the store buffer for the most recent store to that address. A 
MEMBAR #Lookaside is not needed between a store and a subsequent load at 
the same noncacheable address. 


A MEMBAR #StoreLoad must be used to prevent a load from bypassing a prior 
store, if Strong Sequential Order is desired. 


Stores are processed in program order. 
Stores cannot bypass earlier loads. 


An L2 cache update is delayed on a store hit until all outstanding stores reach 
global visibility. For example, a cacheable store following a noncacheable store is 
not globally visible until the noncacheable store has reached global visibility; 
there is an implicit MEMBAR #MemIssue between them. 
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8.2.2 


Relaxed Memory Order 


UltraSPARC T1 implements the following programmer-visible properties for 
accesses through special ASIs that operate under the Relaxed Memory Order (RMO) 
model: 


m There is no implicit order between any two memory references, either cacheable 
or noncacheable, except that noncacheable accesses to I/O space are all strongly 
ordered with respect to each other. 


m A MEMBAR must be used between cacheable memory references if stronger 
order is desired. A MEMBAR #MemIssue is needed for ordering of cacheable 
after non-cacheable accesses. A MEMBAR #StoreLoad should be used between 
a store and a subsequent load at the same noncacheable address. 
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CHAPTER 9 


Address Spaces and ASIs 








9.1 


41 


2 


Address Spaces 


UltraSPARC T1 supports a 48-bit virtual address space. 


Access to Nonexistent Memory or I/O 


Accesses to nonexistent memory or I/O locations are treated as follows: 
m A load access from a nonexistent memory or I/O location causes an exception 


m An instruction fetch from a nonexistent memory or I/O location causes an 
exception 


m A store access to a nonexistent memory or I/O location will be silently discarded 
by the system 


48-bit Virtual Address Space 


UltraSPARC T1 supports a 48-bit subset of the full 64-bit virtual address space (see 
FIGURE 9-1). Although the full 64 bits are generated and stored in integer registers, 
legal addresses are restricted to two equal halves at the extreme lower and upper 
portions of the full virtual address space. Virtual addresses between 

0000 8000 0000 000046 and FFFF 7FFF FFFF FFFFy,, inclusive, lie within a "VA Hole”, 
are termed "out of range," and are illegal. 


Prior UltraSPARC implementations introduced the additional restriction on software 
to not use pages within 4 Gbytes of the VA hole as instruction pages, to avoid 
problems with prefetching into the VA hole. UItraSPARC T1 assumes that this 
convention is followed, for similar reasons. Note that there are no trap mechanisms 
to detect a violation of this convention. 
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FFFF FFFF FFFF FFFF 


FFFF 8001 0000 0000 
FFFF 8000 0000 0000 





See Not 


~ 


FFFF 7FFF FFFF FFFF 


Out-of-Range VA 
(“VA Hole”) 


0000 8000 0000 0000 
0000 7FFF FFFF FFFF 


NNNNNNNNN 
NNNNNNNNN 
NNNNNNNNN 
NNNNNNNNN 
NNNNNNNNN 
NNNNNNNNN 
NNNNNNNNN 
NNNNNNNNN 
SINNNNNNNNNN= 
NNNNNNNNN 
NNNNNNNNN 


0000 7FFE FFFF FFFF 


0 
D 
D 
2 
o 
5 
D 
₪ 


0000 0000 0000 0 





Note (1): Prior implementations restricted use of this region to data only. 


FIGURE 9-1 UltraSPARC T1’s 48-bit Virtual Address Space, With Hole 


Note | Throughout this document, when virtual address fields are 
specified as 64-bit quantities, bits 63:48 are assumed to be sign- 


extended from bit 47. 





A number of state registers are affected by the reduced virtual address space. TBA, 
TPC, and TNPC registers are 48 bits wide, sign-extended to 64 bits on read accesses. 
VA watchpointing is 48 bits, zero-extended to 64-bits on read accesses. No checks are 
done when these registers are written by software. It is the responsibility of 
privileged software to properly update these registers. 


An out of range address during an instruction access causes an 
instruction access exception trap if PSTATE.am = 0. 


If the target address of a JMPL or RETURN instruction is an out-of-range address 
and PSTATE.am is not set, a trap is generated with TPC[TL] set to the address of the 
JMPL or RETURN instruction. This instruction access exception trap is lower priority 
than other traps on the JMPL or RETURN (illegal instruction due to nonzero reserved 
fields in the JMPL or RETURN, mem address not aligned trap, or window fill trap), 
because it really applies to the target. The trap handler can determine the out-of- 
range address by decoding the JMPL instruction from the code. 


When any other control transfer instruction traps, it sets TPC[TL] to the address of 
the target instruction. Because the PC is sign-extended to 64 bits, the trap handler 

must adjust the PC value to compute the faulting address by xoring ones into the 

most significant 16 bits. 
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When a trap occurs on the delay slot of a taken branch or call whose target is out-of- 
range or is the last instruction below the VA hole, UltraSPARC T1 records the fact 
that NPC points to an out-of-range instruction in TNPC. If the trap handler executes 
a DONE or RETRY without saving TNPC, the instruction access exception trap is 
taken when the instruction at TNPC is executed. If TNPC is saved and subsequently 
restored by the trap handler, the fact that TNPC points to an out-of-range instruction 
is lost. 


To guarantee that all out of range instruction accesses cause traps, software should 
not map addresses within 2?! bytes of either side of the VA hole as executable. 


An out-of-range address during a data access results in a data access exception trap 
if PSTATE.am is not set. 





9:2 


Alternate Address Spaces 


TABLE 9-1 summarizes the ASI usage in UltraSPARC 11. The Section column lists 
where the operation of the ASI is explained. For several internal ASIs, a range of 
legal VAs is listed. An access outside the legal VA range will be aliased to a legal VA 
by ignoring the upper address bits. 


Notes | (1) All internal, nontranslating ASIs in UltraSPARC T1 can only 
be accessed using LDXA and STXA. This is different than 
UItraSPARC I/II, where LDDFA and STDFA can also be used to 
access internal ASIs. Using LDDFA and STDFA to access an 
internal ASI in UltraSPARC T1 results in a 

data access exception trap. 


(2) ASIs 8046—FF46 are unrestricted (nonprivileged and 
privileged software may access). ASIs 0016-2F16 are restricted to 
privileged software. 











TABLE 9-1 UltraSPARC T1 ASI Usage (1 of 5) 
Copy 
per 
ASI ASI NAME R/W VA strand Description Section 
0016- Any — data access exception 
0316 
046 — ASI, NUCLEUS RW Any — (See UltraSPARC Architecture 
2005) 
0546- Any — data access exception 
06 
0016 ASI NUCLEUS LITTLE RW Any — (See UltraSPARC Architecture 
2005) 





TABLE 9-1 UltraSPARC 11 ASI Usage (2 of 5) 

ASI ASI NAME 

0D16- 

OF 16 

1016 ^ ASI AS IF USER PRIMARY 

lig | ASI AS IF USER SECONDARY 

Dg 

1316 

ASI REAL‏ | 6ן14 

—ASI REAL IO‏ 6ן15 

ASI BLOCK AS IF USER PRIMARY‏ 6ן16 

ASI BLOCK AS IF USER SECONDARY‏ | 6ן17 

181g | ASI AS IF USER PRIMARY LITTLE 

196 ASI AS IF USER SECONDARY LITTLE 

1A16- 

1B16 

1Cyjg ASI REAL LITTLE 

1016 ASI REAL IO LITTLE 

lEjg ^ ASI BLOCK AS IF USER PRIMARY | 
LITTLE 

1Fig ^ ASI BLOCK AS IF USER SECONDARY. 
LITTLE 

206 ^ ASI SCRATCHPAD 

ASI MMU CONTEXTID‏ | 6ן21 



































R/W 


RW 


RW 


RW 


RW 


RW 


RW 


RW 


RW 


RW 


RW 


RW 


RW 


RW 


RW 


RW 


RW 


VA 


Copy 
per 
strand 


Description 
data access exception 


(See UltraSPARC Architecture 
2005) 


(See UltraSPARC Architecture 
2005) 


data access exception 


(See UltraSPARC Architecture 
2005) 


(See UltraSPARC Architecture 
2005) 


(See UltraSPARC Architecture 
2005) 


(See UltraSPARC Architecture 
2005) 


(See UltraSPARC Architecture 
2005) 


(See UltraSPARC Architecture 
2005) 


data access exception 


Section 


92.1 


9.2.2 


5.8 


5.8 


Nonallocating in L1 cache, same 9.2.1 


as ASI REAL IO LITTLE for 


I/O addresses 


(See UltraSPARC Architecture 
2005) 


(See UltraSPARC Architecture 
2005) 


(See UltraSPARC Architecture 
2005) 

Scratchpad registers 0-3 
data access exception 


Scratchpad registers 6-7 


(See UltraSPARC Architecture 
2005) 


9.2.2 


5.8 


5.8 


9.2.3 


9.2.3 


9.2.3 
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TABLE 9-1 UltraSPARC T1 ASI Usage (3 of 5) 
ASI ASI NAME 
2216 - ASI. LDTX AIUP, 


2316 


2416 


2516 


2616 


2716 


2816- 
2916 
26 


26 


26 


ASI STBI AIUP 


ASI אד‎ 








ASI STBI AIUS 


ASI TWINX (ASI LDTX), 


ASI QUAD LDDPt, 
ASI NUCLEUS, QUAD LDDPt 


ASI QUEUE 


ASI LDTX REAL 


ASI LDTX ,א‎ 


ASI STBI N 


ASI LDTX AIUP L, 


ASI STBI AIUP L 


ASI LDTX AIUS L, 


ASI STBI AIUS L 


ASI TWINX LITTLE (ASI LDTX I), 


ASI QUAD ספת‎ LITTLEP+, 
ASI. NUCLEUS. QUAD, LDD LITTLEP+ 





R/W 
RW 


RW 


RW 


RW 


RW 


RW 


RW 


R 


VA 
Any 


Copy 
per 


strand Description 


Section 


(See UltraSPARC Architecture 5.10 
2005) 

ASI, STBI AIUP is used for 
Block-Initializing stores, As If 
User, Primary Context 

(See UltraSPARC Architecture 
2005) 

ASI, STBI AIUS is used for 
Block-Initializing stores, As If 
User, Secondary Context 
128-bit atomic Load Twin 
Doubleword (deprecated; 
superseded by ASI 2716) 


Load/store does NOP 


5.10 


(See UltraSPARC Architecture 
2005) 

128-bit atomic LDTX, real 
address (see UltraSPARC 
Architecture 2005) 


(See UltraSPARC Architecture 
2005) 


5.11 


5.10 


ASI STBI N is used for Block- 
Initializing stores, Nucleus 
Context 


data access exception 


(See UltraSPARC Architecture 5.10 
2005) 

ASI STBI AIUP Lis used for 
Block-Initializing stores, As If 
User, Primary Context, Little 
Endian 

(See UltraSPARC Architecture 
2005) 

ASI STBI AIUS Lis used for 
Block-Initializing stores, As If 
User, Seondary Context, Little 
Endian 

128-bit atomic Load Twin 
Doubleword, little endian 
(deprecated; superseded by ASI 
2F 16) 


5.10 





TABLE 9-1 UltraSPARC 11 ASI Usage (4 of 5) 
ASI ASI NAME 
2Eqg —ASI LDTX REAL L 


2106 





8946 
6 
8B16 


8C16- 
BF4g 
C016 
Clg 
C246 
C316 
C446 
C516 


C716 


C816 
C916 
CA416 
CBig 


ASI LDTX NL, 





ASI STBI NL 


ASI PRIMARY 


ASI SECONDARY 


ASI PRIMARY NO FAULT 


ASI SECONDARY NO FAULT 


ASI PRIMARY LITTLE 


ASI SECONDARY LITTLE 





ASI PRIMARY NO FAULT LITTLE 





ASI SECONDARY NO FAULT LITTLE 


ASI PST8 P 
ASI PST8 5 
ASI PST16 P 
ASI PST16 S 
ASI PST32 P 
ASI PST32 S 


ASI PST8 PL 
ASI PST8 SL 
ASI PST16 PL 
ASI PST16 SL 











R/W 


RW 


RW 


RW 


RW 


RW 
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VA 
Any 
Any 


Any 


Copy 
per 
strand 


Description 

data access exception 
128-bit atomic LDTX, real 
address, little endian (see 
UltraSPARC Architecture 2005) 
(See UltraSPARC Architecture 
2005) 


ASI STBI NL is used for Block- 


Initializing stores, Nucleus 
context, Little-Endian 


(See UltraSPARC Architecture 
2005) 


(See UltraSPARC Architecture 
2005) 


(See UltraSPARC Architecture 
2005) 


(See UltraSPARC Architecture 
2005) 


data access exception 


(See UltraSPARC Architecture 
2005) 


(See UltraSPARC Architecture 
2005) 


(See UltraSPARC Architecture 
2005) 


(See UltraSPARC Architecture 
2005) 


data access exception 


data access exception 
data access exception 
data access exception 
data access exception 
data access exception 
data access exception 
data access exception 


data access exception 
data access exception 
data access exception 
data access exception 


TABLE 9-1 UltraSPARC T1 ASI Usage (5 of 5) 


























Copy 
per 

ASI ASI NAME R/W VA strand Description Section 

66016 ASI PST32 PL Any — data access | exception! 

CD4g ASI PST32 SL Any — data access | exception! 

CE156- Any — data access exception 

CFyg 

D016 ASI_FL8_P Any = data access exception” 

Dljg ASI FL8 S Any — data access | exception? 

D21g ASI_FL16_P Any — data access exception? 

D313, ASI FL16 S Any — data access exception” 

D416- Any — data access exception 

D846 ASI FL8 PL Any — data access | exception? 

D9g  ASI FL8 SL Any — data access exception” 

DAjg ASI FL16 PL Any — data access | exception? 

DBig ASI FL16 SL Any — data access | exception” 

DC16- Any — data access exception 

DFig 

E016 ASI_BLK_COMMIT_P RW Any — data access exception” 

Elig ASI BLK COMMIT S RW Any — data access exception 

E416- Any — data access exception 

E916 

EC16- Any — data access exception 

EF 16 

F016 ASI_BLK_P RW Any —  64-byte block load/store, 5.8 
primary address 

Flig ASI BLK S RW Any —  64-byte block load/store, 5.8 
secondary address 

F216- Any — data access exception 

F816 ASI_BLK_PL RW Any —  64-byte block load/store, 5.8 
primary address, little endian 

F916 ASI_BLK_SL RW Any —  64-byte block load/store, 5.8 
secondary address, little endian 

FA16- Any — data access exception 

FF4g 





t This ASI name has been changed, for consistency; although use of this name is deprecated and software should use the new name, 
the old name is listed here for compatibility. 


1. ASIs C016-C516, 6816-6216, 2016-2316, D816-DB16, and E016-E116 are checked for a VA watchpoint and will generate a VA Watchpoint 
trap if the watchpoint conditions are met. They are also checked for word-alignment and doubleword-alignment on STDFA, and will 
generate a mem address not aligned trap if the effective address (R[rs1] + R[rs2]; note that R[rs2] is not used as a mask) is not word- 
aligned or a stdf mem address not aligned trap is the address is word-aligned, but not doubleword-aligned. 


2. ASIs D016-D316 and D816-DB416 are checked for a VA watchpoint and will generate a VA Watchpoint trap if the watchpoint conditions 
are met. They are also checked for word-alignment and doubleword-alignment on STDFA and LDDFA, and will generate a 
mem address not aligned trap if the address is not  word-aligned or a  stdf mem address not aligned/ 
Iddf mem address not aligned trap is the address is word-aligned, but not doubleword-aligned. 

ASIs E06-E11g are checked for a VA watchpoint and will generate a VA Watchpoint trap if the watchpoint conditions are met. They are 
also checked for word-alignment and doubleword-alignment on STDFA, and will generate a mem address not alignedtrap if the ad- 
dress is not word-aligned or a stdf mem address not aligned trap is the address is word-aligned, but not doubleword-aligned. 


e 


9.2.1 ASI_REAL and ASI_REAL_LITTLE 


This ASI is used to bypass the data MMU for memory addresses. Since the cp page 
attribute bit is clear, load accesses using this ASI will always fetch their data from 
the L2 cache. Using this ASI for an I/O address is permitted, and will follow the 
same page attributes (w = 1, all other attributes 0). 





Programming | Although it is permitted to use ASI_REAL (or 
Note | ASI_REAL_LITTLE) for an I/O access, it is not recommended 
to do so because the e bit is not set for the access. ASI_REAL_IO 
and ASI_REAL_IO_LITTLE should be used instead. 


























9.2.2 ASI_REAL_IO and ASI_REAL_IO_LITTLE 


This ASI is used to bypass the data MMU for I/O addresses. The physical page 
attributes e and w are set to 1 and all other attribute bits are set to 0 for accesses to 
this ASI. Using this ASI for a memory address is permitted, and will follow the same 
page attributes (e = 1, w = 1, all other attributes 0). 


Note | An atomic load-store operation is not permitted to these ASIs; 
an attempt to execute one will result in a data access exception 
exception. 


9.2.3 ASI SCRATCHPAD 


Each strand has a set of six privileged ASI SCRATCHPAD registers, accessed through 
ASI 2016 with VA{63:0} = 016, 816, 1016, 1816, 3016, or 3846. These registers are for 
scratchpad use by privileged software. VA 2016 and 2846 may be used to access two 
additional scratchpad registers. However, access to those two scratchpad registers 
will be much slower than to the other six (because accesses to them will cause a trap 
and the access will be emulated). 


TABLE 9-2 defines the format of these registers. 
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TABLE 9-2 


Scratchpad - AST SCRATCHPAD (ASI 2016; VA 016, 816, 1016, 1816, 3016, or 3816) 





Bit Field R/W Description 
63:0 scratchpad RW Scratchpad. 
Warning | There is a known “feature” in UltraSPARC T1 that affects 





LDXA/STXA by privileged code to these ASI registers. If an 
immediately preceeding instruction is a store that takes a trap, 
an LDXA can corrupt an unrelated IRF (integer register file) 
register, or a STXA may complete in spite of the trap. To prevent 
this, it is required to have a non-store or NOP instruction before 
any LDXA/STXA to this ASI. If the LDXA/STXA is at a branch 
target, there must be a non-store in the delay slot. Nonprivileged 
software is not affected by this. 
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CHAPTER 1 0 


Performance Instrumentation 








10.1 | Performance Control Register 


Each virtual processor has a privileged Performance Control register (PCR). 
Nonprivileged accesses to this register cause a privileged opcode trap. The 
performance control register contains six fields: ovfh, ovfl, sl, ut, st, and priv. 


m Ovíh and ovfl are state bits associated with the PIC.h and PIC.I overflow traps and 
are provided in this register to allow swapping out of a process that is in the state 
between the counter overflowing and the overflow trap being generated. 


m Sl controls which events are counted in PIC.I. 

m Ut controls whether user-level (nonprivileged) events are counted. 

m Stcontrols whether supervisor-level (privileged) events are counted. 

m priv controls whether the PIC register can be read or written by nonprivileged 


software. 


The format of this register is shown in TABLE 10-1. Note that changing the fields in 
PCR does not affect the PIC values. To change the events monitored, software needs 


to disable counting via PCR, reset the PIC, and then enable the new event via the 
PCR. 


TABLE 10-1 Performance Control Register - PCR (ASR 1016) 








Bit Field Initial Value R/W Description 
63:10 — 0 R Reserved 
9 ovfh 0 RW If 1, PIC.h has overflowed, and the next count event will 


cause a disrupting trap to hyperprivileged software. The 
trap will appear to be precise to the instruction following 


the event. 
ovfl 0 RW If 1, PIC.I has overflowed 


7 — 0 R Reserved 
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TABLE 10-1 Performance Control Register - PCR (ASR 1016) 


Bit Field Initial Value R/W Description 
64 sl 0 RW Selects one of eight events to be counted for PIC. per — 
TABLE 10-2. 
3 — 0 R Reserved 
ut 0 RW If ut = 1, count events in user mode; otherwise, ignore 


user mode events. 

1 st 0 RW If st = 1, count events in supervisor mode; otherwise, 
ignore supervisor mode events. 

0 priv 0 RW If priv = 1, prevent access to PIC by user-level code. If 
priv = 0, allow access to PIC by user-level code. 





TABLE 10-2 contains the settings for the sl field. 


TABLE 10-2 sl Field Settings 


Event Names Encoding PIC Description 

Instr_cnt sl = XXX H Number of completed instructions. Annulled, mispredicted, or trapped 
instructions are not counted.! 

SB_full sl = 000 L Number of store buffer full cycles.? 


FP instr cnt sl = 1 L Number of completed floating-point instructions.? Annulled or trapped 
instructions are not counted. 








IC_miss sl = 010 L Number of instruction cache (L1) misses. 

DC_miss sl = 011 L Number of data cache (L1) misses for loads (store misses are not included as 
the cache is write-through, non-allocating). 

ITLB miss sl = 100 L Number of instruction TLB miss trap taken (includes real_translation 
misses). 

DTLB_miss sl = 101 L Number of data TLB miss trap taken (includes real_translation misses). 

L2_imiss sl = 110 L Number of secondary cache (L2) misses due to instruction cache requests. 

L2_dmiss_ld sl = 1 L Number of secondary cache (L2) misses due to data cache load requests. 





1. Tec instructions that are cancelled due to encountering a higher-priority trap are still counted. 


2. SB full increments every cycle a strand (virtual processor) is stalled due to a full store buffer, regardless of whether other strands are 
able to keep the processor busy. The overflow trap for SB full is not precise to the instruction following the event that occurs when ovfl 
is set (the trap may occur on either the instruction following the event that occurs when ovfl is set, or on either of the next two instruc- 
tions). 

3. Only floating point instructions which execute in the shared FPU are counted. The following instructions are executed in the shared 
FPU: FADDS, FADDD, FSUBS, FSUBD, FMULS, FMULD, FDIVS, FDIVD, FSMULD, FSTOX, FDTOX, FXTOS, FXTOD, FITOS, FDTOS, 
FITOD, FSTOD, FSTOI, FDTOI, FCMPS, FCMPD, FCMPES, FCMPED. 


4. L2 misses due to stores cannot be counted by the performance instrumentation logic. 
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10.2 


TABLE 10-3 


SPARC Performance Instrumentation 
Counter 


Each strand (virtual processor) has a Performance Instrumentation Counter register 
(PIC). Access privilege to PIC is controlled by the setting of PCR.priv. When 

PCR priv = 1, a nonprivileged access to this register causes a privileged action trap. 
The PIC counter contains two fields, h and l. The PIC.h field always counts the 
number of completed instructions. The PIC.| field counts the event selected by 
PCR.sl. 


The ut and st fields for PCR control whether events from user (nonprivileged) mode, 
supervisor (privileged) mode, both, or neither are counted. Whenever PCR.ovfh is 
set (which normally occurs when the PIC.h counter overflows, but may also be set 
via a write to the PCR), a disrupting trap is generated on the next event that 
increments the counter.. This trap will appear to be precise to the instruction 
following the one that caused the event 


Programming | A WRasr to PCR that modifies the ovfh or ovfl bit behaves as if 
Note | the ovfh or ovfl bit was modified before the WRasr is executed. 


This implies that if all of the following conditions are true, a 
performance counter overflow trap will be taken (to 
hyperprivileged software) on the instruction following the 


WhRasr: 
* a WRasr is executed in privileged mode that sets ovfh or ovfl to 1 
e PCRst- 1 


* the WRasr generates the event being counted 





The format of the PIC register is shown in TABLE 10-3. 


Performance Instrumentation Counter Register - PIC (ASR 1116) 





Bit Field Initial Value R/W Description 
63:32 h 0 RW Instruction counter. 
31:0 | 0 RW Programmable event counter, event controlled by PCR.sl. 
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CHAPTER 1 1 


Memory Management 








11.1 


11.1.1 


11.1.2 


Translation Table Entry (TTE) 


The Translation Table Entry (TTE) holds information for a single page mapping. The 
TTE is broken into two 64-bit words, representing the tag and data of the translation. 
Just as in 8 hardware cache, the tag is used to determine whether there is 8 hit in the 
TSB. If there is a hit, the data is fetched by software. 


TTE Tag Format 


UltraSPARC T1 supports both the UltraSPARC Architecture 2005 TTE tag format (as 
described in the UltraSPARC Architecture 2005 specification; also known as the 
"sun4v" TTE format) and the older sun4u TTE tag format. 


Note that UItraSPARC T1 only supports 13-bit context IDs; therefore, the most 
significant 3 bits of the (16-bit) context field are always zero. 


UltraSPARC T1 supports 48-bit virtual addresses in hardware. When hardware 
writes a 48-bit virtual address into a 64-bit register, it sign-extends (copies) the most 
significant address bit (bit 47) into bits 63:48 of the register. 


TTE Data Format 


For the data portion of the TTE, both the sun4v and sun4u formats are supported by 
UltraSPARC T1. The sun4v TTE data format is described in the UltraSPARC 
Architecture 2005 specification. 


UltraSPARC Architecture 2005 specifies a 4-bit size field for TTE entries. Since 
UltraSPARC T1 only supports a 3-bit size field, the most significant bit of TTE (bit 3) 
is ignored when written. 
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In the sun4u TTE virtual address tag, bits 63:22 are used. Bits 21 through 13 are not 
maintained in the tag, since these bits are used to index the smallest direct-mapped TSB of 
512 entries. 


The sun4u TTE data format is shown in TABLE 11-1. 


TABLE 11-1 Format 16 Sun4u TTE Data Format 





Bit Field Description 

63 V Valid 

62:61 szl size{1:0} 

60 nfo No-fault-only 

59 ie Invert endianness 

58:49 soft2 Soft2 

48 szh size{2} 

47:40 diag Diagnostic 

39:13 pa PA {39:13} 

12:8 soft Soft 

7 — Reserved 

6 | Locked 

5 cp Cacheable in physically indexed cache 
4 cv Cacheable in virtually indexed cache 
3 e Side effect 

2 p Privileged 

1 w Writable 

0 = Reserved 





TABLE 11-2 provides UltraSPARC T1-specific information regarding sun4u TTE data 
fields. 


TABLE 11-2 TTE Field Description 








Field Description 

nfo No-Fault-Only.. 

ie Invert Endianness. 

soft, soft2 Software-defined fields, provided for use by the operating system. Software fields are 
not implemented in UltraSPARC T1 hardware. 

diag Used by diagnostics 

pa The physical page number. 

I Lock. If this bit is set, the TTE entry will be “locked down”. 

w Writable. 
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11.2 


Translation Storage Buffer 


A Translation Storage Buffer (TSB) is an array of TTEs managed entirely by software. 
It serves as a cache of the Software Translation Table. The discussion in this section 
assumes the use of hardware support for TSB access, although the operating system 
is not required to make use of this support hardware. 


Inclusion of the TLB entries in 8 TSB is not required; that is, translation information 
may exist in the TLB that is not present in the TSB. 


A TSB is arranged as a direct-mapped cache of TTEs.The n least significant bits of a 
virtual page number is used as the offset from the respective TSB base address, 
where n equals log, of the number of TTEs in the TSB. 


A bit in the TSB register allows the PSO and PS1 pointers to be computed for the case 
of separate or split P50/PS1 TSB(s). 


No hardware TSB indexing support is provided for TTEs of pages other than PSO 
and PS1. Since the TSB is entirely software managed, however, the operating system 
may choose to place these different page TTEs in the TSB by forming the appropriate 
pointers. In addition, simple modifications to the PSO and PS1 index pointers 
provided by the hardware allow formation of an M-way set-associative TSB, 
multiple TSBs per page size, and multiple TSBs per process. 


The TSB exists as a normal data structure in memory, and therefore may be cached. 
Indeed, the speed of the TLB miss handler relies on the TSB accesses hitting the 
level-2 cache at a substantial rate. This policy may result in some conflicts with 
normal instruction and data accesses, but the dynamic sharing of the level-2 cache 
resource should provide a better overall solution than that provided by a fixed 
partitioning. 


FIGURE 13-1 shows both the common and shared TSB organization. The constant n is 
determined by the size field in the TSB register; it may range from 512 entries to 
16 M entries. 


Tag1 (8 bytes) e e Data1 (8 bytes) 


000046 
n Lines in Common TSB 





Tagn (8 bytes) Datan (8 bytes) 


Tag1 (8 bytes) Data1 (8 bytes) 





2n Lines in Split TSB 


Tagn (8 bytes) Datan (8 bytes) 








FIGURE 11-1 TSB Organization 
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11.3 | MMU-Related Faults and Traps 


MMU traps are described in TABLE 11-3. 


TABLE 11-3 MMU Trap Description 





Trap 


Description 





data access exception 


instruction access exception 


mem address not aligned 
privileged action 


VA watchpoint 


Occurs when one of the following events (the D-MMU does not prioritize 

these and may set multiple bits) occurs: 

* The D-MMU detects a privilege violation for a data access; that is, an 
attempted access to a privileged page when PSTATE.priv - 0. 

* A speculative (nonfaulting) load instruction issued to a page marked 
with the side-effect (e) bit = 1. 

* An atomic instruction issued to an I/O address (that is, VA{39} = 1). 

* An invalid LDA/STA ASI value, invalid virtual address, read to write- 
only register, or write to read-only register, but not for an attempted 
user access to a restricted ASI (see the privileged action trap described 
below) 

* An access with an ASI other than 
ASI «PRIMARY,SECONDARY» NO FAULT[ LITTLE] to a page marked 
with the nfo (no-fault-only) bit. 

* Virtual address out of range and PSTATE.am is not set. See 48-bit 
Virtual Address Space on page 39 for details. 


Occurs when the I-MMU is enabled and one of the following happens: 

* The I-MMU detects a privilege violation for an instruction fetch; that is, 
an attempted access to a privileged page when PSTATE priv = 0. 

* Virtual address out of range and PSTATE.am is not set. See 48-bit 
Virtual Address Space on page 39. Note that the case of JMPL/RETURN 
and branch-CALL-sequential are handled differently. 

Occurs when a load, store, atomic, or JMPL/RETURN instruction with a 

misaligned address is executed. 

Occurs when an access is attempted using a restricted ASI while in 

nonprivileged mode (PSTATE.priv = 0). 


Occurs when virtual watchpoints are enabled and the D-MMU detects a 
load or store to the virtual address specified by the VA Data Watchpoint 
register. 


58 UltraSPARC T1 Supplement * Draft D2.0, 17 Mar 2006 


CHAPTER 1 2 


Implementation Dependencies 








12.1 SPARC V9 General Information 


Af Level-2 Compliance (Impl. Dep. #1) 


UltraSPARC T1 is designed to meet Level-2 SPARC V9 compliance. It does the 
following: 


m Correctly interprets all nonprivileged operations, and 


m Correctly interprets all privileged elements of the architecture. 


Note | System emulation routines (for example, for quad-precision 
floating-point operations) shipped with UltraSPARC T1 also 
must be Level-2 compliant. 


12.1.2 Unimplemented Opcodes, ASIs, and ILLTRAP 


SPARC V9 unimplemented instructions, reserved instructions, ILLTRAP opcodes, and 
instructions with invalid values in reserved fields (other than reserved FPops and the 
reserved field in the 166 instruction) encountered during execution cause an 

illegal instruction trap. Reserved FPops cause an fp_exception_other (with 

FSR.ftt = unimplemented FPop) trap. Unimplemented and reserved ASI values cause 
a data access exception trap. 


12.1.3 Trap Levels (Imp. Dep. #37, 38, 39, 40, 101, 114, 
115) 


UltraSPARC T1 supports two trap levels; that is, MAXPTL = 2. Normal execution is at 
TL « 0. 
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A strand normally executes at trap level 0 (execute state, TL=0). In SPARC V9, 
a trap makes the CPU enter the next higher trap level, which is a fast and efficient 
process because there is one set of trap state registers for each trap level. After 
saving the most important machine states (PC, NPC, PSTATE) on the trap stack at 
this level, the trap (or error) condition is processed. 


12.1.4 Trap Handling (Imp. Dep. #16, 32, 33, 35, 36, 44) 


UltraSPARC T1 supports precise trap handling for all operations except for 
disrupting traps from hardware failures and interrupts. UltraSPARC T1 implements 
precise traps, interrupts, and exceptions for all instructions, including long latency 
floating-point operations. 


UItraSPARC Tl can efficiently execute kernel code even in the event of multiple 
nested traps, promoting processor efficiency while dramatically reducing the system 
overhead needed for trap handling. Three sets of global registers are 
provided(MAXPGL = 2), for use at TL = 0, TL = 1, and TL = 2. 


This further increases OS performance, providing fast trap execution by avoiding the 
need to save and restore registers while processing exceptions. 


All traps supported in UltraSPARC T1 are listed in the “Traps” chapter of this 
document. 


12.1.5 Population Count Instruction (POPC) 


The population count instruction, POPC, generates an illegal instruction exception 
and is emulated in software rather that being executed in hardware. 


12.1.6 Secure Software 


To establish an enhanced security environment, it may be necessary to initialize 
certain strand states between contexts. Examples of such states are the contents of 
integer and floating-point register files, condition codes, and state registers. See also 
Clean Window Handling (Impl. Dep. 4102). 


12.1.7 Address Masking (Impl. Dep. #125) 


When PSTATE.am = 1, the CALL, JMPL, and RDPC instructions and all traps 
transmit zero in the high-order 32-bits of the PC to their specified destination 
registers. Traps also transmit zero in the high-order 32-bits of the NPC to the TNPC. 
Branch target addresses sent to the NPC and the updating of NPC with NPC+4 for a 
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non-control-transferring instruction do not zero the high-order 32-bits. Restoration 
of PC and NPC from TPC and TNPC on a DONE or RETRY instruction do not mask 
the high-order 32-bits. 


Note | When PSTATE.am - 1, address masking applies to all VAs, even 
those that immediately do a VA-to-RA bypass. This implies that 
with PSTATE.am = 1, RA(63:32) will be zeros after a VA-to-RA 
bypass. 
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1 


12.2.2 


SPARC V9 Integer Operations 


Integer Register File and Window Control 
Registers (Impl. Dep. #2) 


UltraSPARC T1 implements an eight-window 64-bit integer register file; that is, 

N REG WINDOWS = 8. UItraSPARC T1 truncates values stored in the CWP, 
CANSAVE, CANRESTORE, CLEANWIN, and OTHERWIN registers to three bits. This 
includes implicit updates to these registers by SAVE(D) and RESTORE(D) 
instructions. The most-significant two bits of these registers read as zero. 


SAVE Instruction 


Upon a SAVE instruction, UltraSPARC T1 initializes the values of the local registers 
in the new window to the same values as the local registers in the old window and 
initializes the values of the out registers in the new window to the same values as the 
in registers in the old window (that is, the new window matches the old window 
with the ins and outs swapped). Since this implies that they contain values from the 
executing process, V9 compliance is maintained. In this sense, the behavior of the 
SAVE instruction on UltraSPARC T1 differs from most other SPARC V9 
implementations. 


1 Most SPARC V9 processors do not initialize the local and out registers on a save instruction; instead, the values 
inthe local and out registers are those left there from the last time the window was used. 


12.2.3 


12.2.4 


12.2:5 


Clean Window Handling (Impl. Dep. #102) 


SPARC V9 introduced the concept of "clean window” to enhance security and 

integrity during program execution. A clean window is defined to be a register 
window that contains either all zeroes or addresses and data that belong to the 
current context. The CLEANWIN register records the number of available clean 
windows. 


When a SAVE instruction requests a window and there are no more clean windows, 
a clean window trap is generated. Note that the behavior on a clean window trap for 
UltraSPARC T1 is the same as for a SAVE instruction, namely, the local registers for 
the new window remain the same as the local registers from the old window, while 
the out registers in the new window contain the contents of the in registers from the 
old window. Thus, while UltraSPARC T1 generates a clean window trap, the new 
window is automatically cleaned by hardware. System software only needs to 
increment CLEANWIN before returning to the requesting context. 


Integer Multiply and Divide 


Integer multiplications (MULScc, SMUL{cc}, MULX) and divisions (SDIV{cc}, 
UDIV(ce), UDIVX) are executed directly in hardware. 


MULScc 


SPARC V9 does not define the value of xcc and R[rd]{63:32} for MULScc. 
UltraSPARC T1 sets xcc and rd based on the results of adding either (32 copies of 
R[rs1](63]:: CCR.icc.n xor CCR.icc.v, R[rs1][31:1]) or 0 (depending on Y(0]) to either 
R[rs2]{63:0} or the immediate operand. 





12.3 


12:3. 


SPARC V? Floating-Point Operations 


Subnormal Operands and Results: Nonstandard 
Operation 


UltraSPARC T1 handles all cases of subnormal operands or results directly in 
hardware. 
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12.3.2 


12.3.3 


Because there is no trapping on subnormal operands, UltraSPARC T1 does not 
support the nonstandard result option of the SPARC V9 architecture, and the FSR.ns 
bit ignores any value written to it and always returns zero on a read. 


Overflow, Underflow, and Inexact Traps (Impl. 
Dep. #3, 55) 

UltraSPARC T1 implements precise floating-point exception handling. Underflow is 
detected before rounding. 


Note | Major performance degradation may be observed while running 
with the inexact exception enabled. 


Quad-Precision Floating-Point Operations (Impl. 
Dep. #3) 


All quad-precision floating-point instructions, listed in TABLE 12-1, cause an 
fp exception other (with FSR.ftt = 3, unimplemented FPop) trap. These operations are 
emulated in system software. 


TABLE 12-1 Unimplemented Quad-Precision Floating-Point Instructions 





Instruction Description 

F{s,d}TOq Convert single-/double- to quad-precision floating-point 
F{i,x}TOq Convert 32-/64-bit integer to quad-precision floating-point 
FqTO{s,d} Convert quad- to single-/double-precision floating-point 
FqTO{i,x} Convert quad-precision floating-point to 32-/64-bit integer 
FCMP{E}q Quad-precision floating-point compares 

FMOVq Quad-precision floating-point move 

FMOVqcc Quad-precision floating-point move, if condition is satisfied 
FMOVqr Quad-precision floating-point move if register match condition 
FABSq Quad-precision floating-point absolute value 

FADDq Quad-precision floating-point addition 

FDIVq Quad-precision floating-point division 

FdMULq Double- to quad-precision floating-point multiply 

FMULq Quad-precision floating-point multiply 


TABLE 12-1 Unimplemented Quad-Precision Floating-Point Instructions (Continued) 





Instruction Description 

FNEGq Quad-precision floating-point negation 
FSQRTq Quad-precision floating-point square root 
FSUBq Quad-precision floating-point subtraction 


12.3.4 Floating-Point Square Root 


The three floating-point square root instructions: FSORTS, FSORTD, FSORTO are 
unimplemented. Execution of any of these instructions results in an 
fp_exception_other exception, with FSR.ftt= unimplemented_FPop. 


12.3.5 Floating-Point Upper and Lower Dirty Bits in 
FPRS Register 


The FPRS_dirty_upper (du) and FPRS_dirty_lower (dl) bits in the Floating-Point 
Registers State (FPRS) register are set when an instruction that modifies the 
corresponding upper and lower half of the floating-point register file is issued. 
Floating-point register file modifying instructions include floating-point operate, 
graphics, floating-point loads, and block load instructions. 


The FPRS.du and FPRS.dl may be set pessimistically, even though the instruction 
that modified the floating-point register file is nullified due to a trap. This includes 
the case where the floating-point instruction itself takes a fp disabled trap. 
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12.3.6 


Floating-Point State Register (FSR) (Impl. Dep. 
%13, 19, 22, 23, 24) 


UltraSPARC T1 supports precise-traps and implements all three exception fields 
(tem, cexc, and aexc) conforming to IEEE Standard 754-1985. TABLE 12-2 defines the 
register bits. 











TABLE 12-2 Floating-Point Status Register Format 
Bits Field RW Description 
63:38 — R Reserved 
37:36 fcc3 RW Floating-point condition code (set 3). One of four sets of 2-bit floating-point 
condition codes, which are modified by the FCMP{E} (and LD{X}FSR) 
instructions. The FBfcc, FMOVcc, and MOVcc instructions use one of these 
condition code sets to determine conditional control transfers and conditional 
register moves. 
35:34 fcc2 RW Floating-point condition code (set 2). See fcc3. 
33:32 foc1 RW Floating-point condition code (set 1). See fcc3. 
31:30 rd RW IEEE Std. 754-1985 rounding direction. Rounding modes are shown 
below. 
rd Round Toward 
0 Nearest (even if tie) 
1 0 
2 +00 
3 —o 
29:28 == 8 Reserved 
27:23 tem RW 5-bit trap enable mask for the IEEE-754 floating-point exceptions. If a floating- 
point operate instruction produces one or more exceptions, the corresponding 
cexc/aexc bits are set and an fp exception ieee 754 (with FSR.ftt = 1, 
IEEE 754 exception) exception is generated. 
22 ns R Nonstandard floating-point results. Always 0: UItraSPARC T1 produces IEEE- 
754 compatible results. 
21:20 — R Reserved 
19:17 ver R FPU version number. Identifies a particular implementation of the UltraSPARC 
T1 FPU architecture. 
16:14 ftt R Floating-point trap type. The 3-bit floating point trap type field is set whenever 
an floating-point instruction causes the fp exception ieee 754 or 
fp exception other traps. Trap types are listed in TABLE 12-4, below. 
13: qne R Floating-point deferred-trap queue (FQ) not empty. Not used, because 
UltraSPARC T1 implements precise floating-point exceptions. 
12 — R Reserved 


TABLE 12-2 Floating-Point Status Register Format (Continued) 


Bits Field RW Description 

11:10 fccO RW Floating-point condition code (set 0). See fcc3. 
Note: 1000 is the same as fcc in SPARC V8. 

9:5 aexc RW 5-bit accrued exception field. Accumulates IEEE 754 exceptions while floating- 
point exception traps are disabled (that is, FSR.tem - 0). 

4:0 cexc RW 5-bit current exception field indicates the most recently generated IEEE 754 
exceptions. 





Note | 1060 is the same as the fcc in SPARC V8. 


TABLE 12-3 Floating-Point Rounding Modes 





rd Round Toward 

0 Nearest (even if tie) 
1 0 

2 too 

3 —oo 


TABLE 12-4 Floating-Point Trap Type Values 





ftt Floating-Point Trap Type Trap Signaled 





None — 

IEEE 754 exception fp exception ieee 754 
unfinished FPop = 

unimplemented FPop fp exception other 
sequence error — 

hardware error — 


invalid fp register — 


NAS WN FY CO 


reserved 2 





Notes | (1) UltraSPARC T1 neither detects nor generates the 
unfinished. FPop, sequence error, hardware error or 
invalid. fp. register trap types directly in hardware. 


(2) UltraSPARC T1 does not contain an FQ. An attempt to read 
the FO with a RDPR instruction causes an illegal instruction trap. 
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12.4 


12.4.1 


12.4.2 


12.4.3 


SPARC V9 Memory-Related Operations 


Load /Store Alternate Address Space (Impl. Dep. 
#5, 29, 30) 


Supported ASI accesses are listed in Alternate Address Spaces on page 41. 


Read/Write ASR (Impl. Dep. #6, 7, 8, 9, 47, 48) 


Supported ASRs are discussed in Ancillary State Registers (ASRs) on page 7. 


FLUSH and Self-Modifying Code (Impl. Dep. 
11122) 


FLUSH is needed to synchronize code and data spaces after code space is modified 
during program execution. FLUSH is described in Supported Memory Models on page 
36. On UltraSPARC T1, the FLUSH effective address is ignored, and as a result, 
FLUSH can not cause a data access exception trap. 


Note | SPARC V9 specifies that the FLUSH instruction has no latency 
on the issuing strand. In other words, a store to instruction 
space prior to the FLUSH instruction is visible immediately after 
the completion of FLUSH. MEMBAR #StoreStore is required 
to ensure proper ordering in multiprocessing system when the 
memory model is not TSO. When a MEMBAR #StoreStore, 
FLUSH sequence is performed, UltraSPARC T1 guarantees that 
earlier code modifications will be visible across the whole 
system. 
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12.4.4 PREFETCH{A} (Impl. Dep. #103, 117) 


For UltraSPARC T1, PREFETCH{A} instructions follow TABLE 12-5 based on the fcn 
value. All prefetches in UltraSPARC T1 are weak (on an MMU miss or when the 
MMU is bypassed the prefetch is dropped). The only trap that a prefetch can 
generate on UltraSPARC T1 is illegal instruction (for fcn = 516—F46). 


TABLE 12-5 PREFETCH{A} Variants 


fen Prefetch Function Action 

016 Weak prefetch for several reads Weak prefetch into Level 2 cache 

116 Weak prefetch for one read 

216 Weak prefetch for several writes 

316 Weak prefetch for one write 

416 Prefetch Page No operation 

516-F16 - Illegal instruction trap. 

1016 Invalidate read-once prefetch Weak prefetch into Level 2 cache 

1116 Prefetch for read to nearest unified Weak prefetch into Level 2 cache 
cache 

1216-1316 Strong prefetches Weak prefetch into Level 2 cache 

1416 Strong prefetch for several reads Weak prefetch into Level 2 cache 

1516 Strong prefetch for one read 

1616 Strong prefetch for several writes 

1716 Strong prefetch for one write 

18186 — No operation 


PREFETCHA is legal for all implemented ASIs in UltraSPARC T1 and will prefetch 
into the Level 2 cache from memory using the context listed in TABLE 12-6. 
Prefetching is done regardless of privilege level (for example, user mode can use ASI 
1016 to prefetch into the L2 cache). 


TABLE 12-6 PREFETCH{A} ASIs 





Context ASIs (hexadecimal) 


Primary 10, 16, 18, 1E, 22, 2A, 80, 82, 88, 8A, CO, C2, C4, C8, CA, CC, DO, D2, 
D8, DA, EO, E2, EA, FO, F8 


Secondary | 11, 17, 19, 1F, 23, 2B, 81, 83, 89, 8B, C1, C3, C5, C9, CB, CD, D1, D3, 
D9, DB, E1, E3, EB, F1, F9 


Nucleus 04, OC, 14, 15, 1C, 1D, 20, 21, 24, 25, 26, 27, 2C, 2E, 2F, 31, 32, 33, 35, 
36, 37, 39, 3A, 3B, 3D, 3E, 3F, 40, 42, 43, 44, 45, 46, 47, 4B, 4C, 4D, 4F, 
50, 51, 52, 54, 55, 56, 57, 58, 59, 5A, 5B, 5C, 5D, 5E, 5F, 60, 66, 67, 72, 
73, 74 
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12.4.5 


12.4.6 


12.4.7 


12.4.8 


12.4.9 


Implementation | Although it would have been desirable to treat PREFETCHA to 
Note | restricted ASIs by underprivileged code as NOPs, PREFETCH 
only moves data between main memory and the L2 cache, so 
UltraSPARC T1’s implementation causes no security issues. 


Instruction Prefetch 


UltraSPARC T1 does not implement an instruction prefetch. No prefetching is 
performed from the effective address of the BPN instruction. 


LDTW/STTW Handling (Impl. Dep. #107, 108) 


LDTW and STTW instructions are directly executed in hardware. 


Note | LDTW /STTW were deprecated in SPARC V9. In UltraSPARC 
T1, it is more efficient to use LDX/STX for accessing 64-bit data. 
LDTW/STTW take longer to execute than two 32-/64-bit loads/ 
stores. 


Floating-Point mem address not aligned (Impl. 
Dep. 4109, 110, 111, 112) 

LDDF{A}/STDF{A} cause an LDDF/STDF mem address not aligned trap if the 
effective address is 32-bit aligned but not 64-bit (doubleword) aligned. 


LDOF(A]/STOF(A] are not directly executed in hardware; they cause an 
illegal instruction trap. 


Supported Memory Models (Impl. Dep. #113, 121) 


UltraSPARC T1 supports only the TSO memory model, although certain specific 
operations such as block loads and stores operate under the RMO memory model. 
See Supported Memory Models on page 36. 


Implicit ASI when TL > 0 (Impl. Dep. #124) 


UltraSPARC T1 matches all UltraSPARC Architecture implementations, and makes 
the implicit ASI for instruction fetching ASI NUCLEUS when TL > 0, while the 
implicit ASI for loads and stores when TL > 0 is ASI NUCLEUS if PSTATE.cle = 0 or 
ASI NUCLEUS. LITTLE if PSTATE.cle = 1. 




















Compatibility | With an implicit ASI for instruction fetching of ASI NUCLEUS, if 
Note | software was to set the strand in a state where PSTATE.priv =0 
but TL > 0, an instruction fetch will generate an 
instruction access exception, because user-level code is 
accessing ASI NUCLEUS. UltraSPARC I/II overrides this 
instruction access exception and allows instruction fetching 
when PSTATE.priv = 0 and TL > 0. UltraSPARC T1 is compatible 
with UltraSPARC I/II and does the same override of 
instruction access exception when PSTATE.priv = 0 and TL » 0. 














12.5 


12:54 


12.5.2 


12.5.3 


12.5.4 


Non-SPARC V9 Extensions 


Cache Subsystem 


UltraSPARC T1 contains one or more levels of cache. The cache subsystem 
architecture is described in Appendix F, Caches and Cache Coherency. 


Block Memory Operations 
UltraSPARC T1 supports 64-byte block memory operations utilizing a block of eight 


double-precision floating point registers as a temporary buffer. See Block Load and 
Store Instructions on page 18. 


Partial Stores 


UltraSPARC T1 does not support 8-/16-/32-bit partial stores to memory. 


Short Floating-Point Loads and Stores 


UltraSPARC T1 does not supports 8-/16-bit loads and stores to the floating-point 
registers. 
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12:55 


12.5.6 


12.5.7 


12.5.8 


Interrupt Vector Handling 


CPUs and I/O devices can interrupt a selected CPU by assembling and sending an 
interrupt packet. This allows hardware interrupts and cross calls to have the same 
hardware mechanism and to share a common software interface for processing. 
Interrupt vectors are described in Chapter 7, Interrupt Handling. 


Power-Down Support 


UltraSPARC T1 supports the ability to power down virtual processors and I/O 
devices to reduce power requirements during idle periods. 


UItraSPARC T1 Instruction Set Extensions (Impl. 
Dep. #106) 


The UltraSPARC T1 CPU supports a subset of the VIS 1.0 and 2.0 instructions; see 
UltraSPARC Architecture 2005 Instructions Not Directly Implemented by UltraSPARC 1 
Hardware on page 13. 


Unimplemented IMPDEP1 and IMPDEP2 opcodes encountered during execution 
cause an illegal instruction trap. 


Performance Instrumentation 


UltraSPARC T1 performance instrumentation is described in Performance Control 
Register on page 49 and SPARC Performance Instrumentation Counter on page 51. 
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APPENDIX A 


Assembly Language Syntax 


The assembly language syntax used in this document follows that described in the 
"Assembly Language Syntax" appendix of the UltraSPARC Architecture 2005 
specification. 
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APPENDIX B 


Programming Guidelines 





B.1 Multithreading 


In UltraSPARC T1, execution is switched in round-robin fashion every cycle among 
the strands that are ready to issue another instruction. Context switching is built into 
the UltraSPARC T1 pipeline and takes place during the SWITCH stage, thus contexts 
are switched each cycle with no pipeline stall penalty. 


The following instructions change a strand from a ready-to-issue state to a not- 
ready-to-issue state, until hardware determines that their input/execution 
requirements can be satisfied: 


All branches (including CALL, JMPL, etc.) 

All VIS instructions 

All floating point (FPops) 

All WRPR, WR 

All RDPR, RD 

SAVE(D), RESTORE(D), RETURN, FLUSHW (all register management) 
All MUL and DIV 

MULSCC 

MEMBAR #Sync, MEMBAR #StoreLoad, MEMBAR #MemIssue 
FLUSH 

All loads 

All floating-point memory operations 

All memory operations to alternate space 

All atomics load-store operations 

Prefetch 


75 





B.2 


Pipeline Strand Flush 


The front end of the UltraSPARC T1 pipeline prevents instructions from being issued 
to the rest of the pipeline unless there is a high probability (for most instructions, a 
probability of 1.0) of the instruction having all its input dependencies satisfied. For 
certain instructions, the input dependencies cannot be determined by the front end, 
and the instruction (and any subsequent instructions issued from that strand) need 
to be flushed from the pipeline and replayed. TABLE B-1 lists instructions that may 
end up causing a strand flush. 


TABLE B-1 Pipeline Strand Flush Events 





Event 


Loads 


multiply / divide/ 


Strand Flush Description 


The strand will be flushed if the load encounters a cache miss while executing with 
STRAND STS REG.spec en- 1. 


Resource contention can cause a strand flush. 


floating-point operate 


store buffer full 


trap 


The strand will be flushed until space is available in the store buffer. 


Instruction and any subsequent instructions in the pipeline from that strand are 
flushed, and fetching restarts at the trap vector. 





B.3 


Instruction Latencies 


TABLE 8-2 lists the single-strand instruction latencies for UltraSPARC 11. When 
multiple strands are executing, much of the additional latency for multicycle 
instructions will be overlapped with execution of the additional strands. 


In this table, certain opcodes are marked with mnemonic superscripts. These 
superscripts and their meanings are defined in TABLE 5-1 on page 13. 


TABLE B-2 Instruction Latencies (1 of 14) 





Instruction Latency Comments 
ADD 1 
ADDc 1 
ADDCC 1 
ADDNcc 1 
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TABLE B-2 Instruction Latencies (2 of 14) 





Instruction Latency Comments 
AND 1 
ANDcc 1 
ANDN 1 
ANDNcc 1 
BA 3? 
BA_A 4 
BA_A_PN 4? 
BA_PN 3 
BA_XCC 3 
BA_XCC_A 4? 
BA_XCC_A_PN 4 
BA_XCC_PN 3? 
BCC 3 
BCC_A 3 
BCC_A_PN 4 
BCC_PN 3 
BCC_XCC 3 
BCC_XCC_A 3 
BCC_XCC_A_PN 4 
BCC_XCC_PN 3? 
BCS 3 
BCS_A 4 
BCS_A_PN 4? 
BCS_PN 3 
BCS_XCC 3 
BCS_XCC_A 4 
BCS_XCC_A_PN 4 
BCS_XCC_PN 4? 
BE 3 
BE_A 4? 
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TABLE B-2 Instruction Latencies (3 of 14) 





Instruction Latency Comments 
BE A PN 4 
BE PN 3 
BE XCC 3 
BE XCC A 4 
BE XCC A PN 4 
BE XCC PN 3 

BG 3 
BG A 3 
BG A PN 4 
BG PN 3 
BG XCC 3 
BG XCC A 3 
BG XCC A PN 3? 
BG XCC PN 3 

BGE 3 
BGE A 3 
BGE A PN 4? 
BGE PN 3 
BGE XCC 3 
BGE XCC A 4 
BGE XCC A PN 4? 
BGE XCC PN 3 

BGU 3 
BGU A 3 
BGU. A. PN 3? 
BGU. PN 3 
BGU XCC 3 

BGU XCC A 4 
BGU XCC. A. PN 4 
BGU. XCC PN 3 
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TABLE B-2 Instruction Latencies (4 of 14) 








Instruction Latency Comments 
BL 3 
BLA 3 
BL A PN 3? 
BL PN 3 
BL XCC 3 
BL XCC A 4? 
BL XCC A PN 4 
BL XCC PN 3 
BLE 9 
BLE A 3 
BLE A PN 3? 
BLE PN 3 
BLE XCC 3 
BLE XCC A 4 
BLE XCC A PN 4? 
B LE XCC PN 3 
BLEU 3 
BLEU A 4? 
BLEU. A PN 4 
BLEU PN 3? 
BLEU XCC 3 
BLEU XCC A 4 
BLEU XCC A PN 4 
BLEU XCC PN 3 
BN 3 
BN A 4 
BN. A PN 4 
BN PN 3 
BN. XCC 3 


BN XCC A 


A 
E 


TABLE B-2 Instruction Latencies (5 of 14) 











Instruction Latency Comments 
BN XCC A PN 4 
BN. XCC PN 3 

BNE 3 
BNE A 3 
BNE A PN 3? 
BNE PN 3 
BNE XCC 3? 
BNE XCC A 4 
BNE XCC A PN 4 
BNE XCC PN 3? 

BNEG 3 
BNEG A 4 
BNEG A PN 4 
BNEG PN 3 
BNEG XCC 3 
BNEG XCC A 4 
BNEG XCC A PN 4 
BNEG XCC. PN 3 

BPOS 3 
BPOS A 4? 
BPOS A PN 4 
BPOS PN 3 
BPOS XCC 3 
BPOS XCC A 4? 
BPOS XCC A PN 4 
BPOS XCC PN 3 

BRGEZ 3 
BRGEZ A 4? 
BRGEZ A PN 4 
BRGEZ PN 3 
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TABLE B-2 Instruction Latencies (6 of 14) 


Instruction 


Latency Comments 





BRGZ 
BRGZ A 
BRGZ A PN 
BRGZ PN 
BRLEZ 
BRLEZ A 
BRLEZ A PN 
BRLEZ PN 
BRLZ 
BRLZ A 
BRLZ A PN 
BRLZ PN 
BRNZ 
BRNZ A 
BRNZ A PN 
BRNZ PN 
BRZ 
BRZ A 
BRZ A PN 
BRZ PN 
BVC 
BVC A 
BVC A PN 
BVC PN 
BVC XCC 
BVC XCC A 
BVC XCC A PN 
BVC XCC PN 
BVS 
BVS A 


3 


U ow Q Q 4 PF Q WO 4d 4 
~N ~ 


on wo $A B U U Ae A WO Q 


A U U A uu 
~N 


4? 
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TABLE B-2 Instruction Latencies (7 of 14) 





Instruction Latency Comments 
BVS A PN 4 
BVS PN 3 
BVS XCC 3 
BVS XCC A 4 
BVS XCC A PN 4 
BVS XCC PN 3 

CASAPast 39 performed in L2 

CASXAPast 39 performed in L2 

FABSd 8 

FABSs 21 

FADDd 26 

FADDs 26 

FBA 3 
FBA_A 4? 
FBA_A_PN 4 
FBA_PN 3 
FBE 3 
FBE_A 4 
FBE_A_PN 4 
FBE_PN 3 

FBG 3 
FBG_A 4 
FBG_A_PN 4? 
FBG_PN 3 

FBGE 3 
FBGE_A 4 
FBGE_A_PN 4 
FBGE_PN 3 

FBL 3 
FBL_A 4? 
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TABLE B-2 Instruction Latencies (8 of 14) 


Instruction 


Latency Comments 





FBL A PN 
FBL PN 
FBLE 
FBLE A 
FBLE A PN 
FBLE PN 
FBLG 
FBLG A 
FBLG A PN 
FBLG PN 
FBN 
FBN A 
FBN A PN 
FBN PN 
FBNE 
FBNE A 
FBNE A PN 





FBNE PN 
FBUE 
FBUE A 
FBUE A PN 
FBUE PN 
FBUG 
FBUG A 
FBUG A PN 
FBUG PN 
FBUGE 
FBUGE A 
FBUGE A PN 





FBUGE PN 


Q A A CQ Wo A A WO Q R 


Q A A O CQ A A O QO Hu A WO Q 
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TABLE B-2 Instruction Latencies (9 of 14) 








Instruction Latency Comments 
FBUL 3 
FBUL A 4 
FBUL A PN 4 
FBUL PN 3 
FBULE 3 
FBULE A 4 
FBULE A PN 4 
FBULE PN 3 
FDIVd 83 
FDIVs 54 
FdTOi 25 
FdTOs 25 
FdTOx 25 
FiTOd 25 
FiTOs 26 





5 
O 
< 
gJ 
Z 
O0 œ Oo © © © © © Oo Oo © Oo Oo 





MOVDUGE 
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TABLE B-2 Instruction Latencies (10 of 14) 











Instruction Latency Comments 
FMOVDUL 8 
FMOVDULE 8 
FMOVs 8 
FMOVSA 8 
FMOVSE 8? 
FMOVSG 8? 
FMOVSGE 8? 
FMOVSL 8? 
FMOVSLE 8? 
FMOVSLG 8? 
FMOVSN 8 
FMOVSNE 8 
FMOVSO 8? 
FMOVSU 8 
FMOVSUE 8 
FMOVSUG 8 
FMOVSUGE 8 
FMOVSUL 8 
FMOVSULE 8 
FMULd 29 
FMULs 29 
FNEGd 8 
FNEGs 8 
FsMULd 29 
FsTOd 25 
FsTOi 25 
FsTOx 25 
FSUBd 26 
FSUBs 26 
FxTOd 26 
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TABLE B-2 Instruction Latencies (11 of 14) 








Instruction Latency Comments 
FxTOs 26 
LD FP 9 
LDFSR 9 
LDD FP 9? 
LDD FPD 9 
LDDFAas! 9 
LDSB 22 performed in L2 
LDSBA 21 performed in L2 
LDSH 3 
LDSHAPs 3 
LDSTUB 37 performed in L2 
LDSTUBAPs 37 performed in L2 
LDSW 3 
LDSWA 3 
LDUB 3 
LDUBA 3 
LDUH 3 
LDUHAPs 3 
LDUW 3 
LDUWAPs 3 
LDX 3 
LDX FSR 27 
LDXAPAS 3 
MOVA 1 
MOVA FCC 1 
MOVCC 1 
MOVCS 1 
MOVE 1 
MOVE FCC 1 
MOVG 1? 
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TABLE B-2 Instruction Latencies (12 of 14) 


Instruction 


Latency Comments 





MOVG FCC 
MOVGE 
MOVGE FCC 
MOVGU 
MOVL 
MOVL FCC 
MOVLE 
MOVLE FCC 
MOVLEU 
MOVLG FCC 
MOVN 
MOVN FCC 
MOVNE 
MOVNE FCC 
MOVNEG 
MOVO FCC 
MOVPOS 
MOVRE 
MOVRGEZ 
MOVRGZ 
MOVRLEZ 
MOVRLZ 
MOVRNE 
MOVU FCC 
MOVUE FCC 





MOVUG FCC 
MOVUGE FCC 
MOVUL FCC 
MOVULE FCC 
MOVVC 








1 
1 
1 
1? 
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TABLE B-2 Instruction Latencies (13 of 14) 





Instruction Latency Comments 
MOVVS 1 
MULSCC 7 
MULX 11 
OR 1 
ORcc 1 
ORN 1 
ORNcc 1 
RD CCR 4 
RDASI 4 
RD FPRS 4 
RD Y 4 
SDIVP 72 
SDIVccP? 72 
SDIVX 72 
SETHI 1 
SLL 1 
SLLX 1 
SMULP 1 
SMULccP 1 
SRA 1? 
SRAX 1 
SRL 1 
SRLX 1 
STFA Fast 8 
ST FSR 8 
STFAP^s 8 
STB 1 
SIBA 4 
STDF 8 
STDA FP 8? 
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TABLE B-2 Instruction Latencies (14 of 14) 





Instruction Latency Comments 

STDA FP ASI 8 

STH 1 

STHA 4 

STW 1? 

STWA 1? 

STX 1 

STX FSR 8 

STXA 1-? (4-7)? varies, depending on ASI 
SUB 1 

SUBC 1 

SUBcc 1 

SUBCcc 1 

SWAPP 49 performed in L2 
SWAPAD Past 37 performed in L2 
TADDcc 1? 

TADDccTVP 1 

TSUBcc 1 

TSUBccTVP 1 

UDIVP 72 

UDIVccP 72? 

UDIVX 72 

UMULP 11 

UMULccP 11 

WR CCR 9 

WRASI 9 

WR FPRS 9 

XNOR 1 

XNORcc 1 

XOR 1 


XORcc 1 





B.4 


Grouping Rules 


Each physical cores in UltraSPARC T1 are single-issue, so there are no grouping 
rules for UltraSPARC 1. 





B.5 


Floating-Point Operations 


UltraSPARC T1 supports hardware floating-point operations, but since one floating- 
point unit (FPU) is shared among 8 physical cores (32 strands), there are limitations 
on dispatch of floating-point instructions to the FPU. Each physical processor core 

(four strands) can have a single floating-point instruction outstanding at any given 

time. For the purpose of this restriction, floating-point instructions include floating- 
point operations, VIS floating-point operations, floating-point loads and stores, and 
block loads and stores. 





B.6 


Synchronization 


UltraSPARC T1 has two varieties of instructions for synchronization: memory 
barriers and flush. The following memory barrier instructions ensure that any load, 
store, or atomic memory operation issued after it take effect after all memory 
operations issued before it: 


= MEMBAR with mmask{1} = 1 (MEMBAR #StoreLoad) 
= MEMBAR with cmask{1} = 1 (MEMBAR #MemIssue) 
m MEMBAR with cmask{2} = 1 MEMBAR #Sync) 


All other types of membar instructions are treated as NOPs, since they are implied 
by the TSO memory ordering protocol followed by UltraSPARC T1. 


However, the memory barriers do not guarantee that the instruction caches on 
UltraSPARC T1 have become consistent with the preceding memory operations. A 
FLUSH instruction guarantees that in addition to the preceding memory operations 
taking effect in the global memory system, all the instruction caches on UltraSPARC 
T1 are consistent with these operations. It also ensures that the instruction fetch 
buffer for the strand issuing the flush has become consistent with the preceding 
memory operations. 
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Thus, when one strand is modifying the instructions of another, the ”producer” 
strand should 


1. Complete all necessary modifications 
2. Issue a FLUSH 


3. Signal completion to the ”consumer” strand 
Completion may be signalled by a store/atomic instruction which modifies a 
predetermined location, or by issuing an interrupt to the consumer strand. 


The consumer strand at this point should make sure that its instruction fetch buffer 
(of size 2 entries) becomes consistent with the global memory system. This can be 
done by either 


1. Issuing a branch to the modified location; or 
2. Issuing a flush instruction; or 


3. Waiting for a two-instruction gap, to allow the two instructions in the fetch buffer 
to drain. 


In the case of a branch, the delay slot is not guaranteed to be consistent with global 
memory. The branch is a better option than flush for high performance. 


Note that the HALT instruction is not meant to be a synchronization instruction and 
should not be used as such. For example, the following code, which uses halt to 
make sure func. A executes before func B, may cause TO to hang: 


> TO. T1. 

st X ld X 
halt do func A 
do func B intr TO 
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APPENDIX C 


Opcode Maps 


This appendix contains the UltraSPARC 11 instruction opcode maps. 


Opcodes marked with a dash (—) are reserved; an attempt to execute a reserved 
opcode causes a trap unless the opcode is an implementation-specific extension to 


the instruction set. 


In this appendix, certain opcodes are marked with mnemonic superscripts. These 
superscripts and their meanings are defined in TABLE 5-1 on page 13. 


In the tables in this appendix, reserved (—) and shaded entries indicate opcodes that 
are not implemented in the UltraSPARC T1 processor. 





Shading Meaning 











An attempt to execute opcode will cause an illegal instruction exception. 








TABLE C-1 = Op{1:0} 


An attempt to execute opcode will cause an fp. exception other exception with 
FSR.ftt = 3 (unimplemented FPop). 




















op {1:0} 
0 1 2 3 
Branches and SETHI CALL Arithmetic and Miscellaneous Loads/Stores 
See TABLE C-2. See TABLE C-3 See TABLE C-4 
TABLE C-2 0p2(2:0] (op = 0) 
op2 (2:0) 
0 1 2 3 4 5 6 7 
ILLTRAP | BPcc- See BiccP- See BPr - See SETHI FBPfcc - See FBfccP- See — 
TABLE C-7 TABLE C-7 TABLE C-8 NOP* TABLE C-7 TABLE C-7 


trd = 0, imm22 = 0 


The ILLTRAP and reserved (—) encodings generate an illegal instruction trap. 
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TABLE C-3 0p3{5:0} (op = 2) (1 of 3) 





op3 {5:4} 





op3 
(3:0) 


0 ADD ADDcc TADDcc 


TADDccTVP 


1 


WRYP (rd = 0) 

— (rd=1) 

WRCCR (rd = 2) 

WRASI (rd = 3) 

— (rd = 4,5) 

— (rd 25, rd 2 0, rd = 1) 
WREPRS (rd = 6) 

— (rd - 7-14) 

— (rd = 15 and (rs1 > 0 ori = 0)) 
WRPCR" (rd = 16, rd = 0) 
— (rd =16, rd = 1) 

WRPIC (rd = 17, rd = 0) 

— (rd = 17, rd 2 1) 

(rd = 18) 

WRGSR (rd = 19) 
WRSOFTINT SET (rd = 0) 
WRSOFTINT CLR" (rd = 21) 
WRSOFTINT (rd = 22) 
WRTICK_CMPR? (rd = 23) 
WRSTICK CMPR (rd = 25) 
WR $asr26 (rd=26,rd=1) 
— (rd = 26, rd = 0) 
(rd=27-31)) 

SAVED? (fcn = 0), 


RESTORED? (fcn = 1) 
— (fcn » 1) 


WRPR 
— (rd = 15, 17-31) 








TSUBccTVP 
SUBcc MULScc? 


— (rs1 = 2, 4, 7-30) 
FPop1 - See TABLE C-5 





ANDN ANDNcc SLL (x = 0), SLLX (x = 1) 


ORN ORNcc SRL (x = 0), SRLX (x = 1) 


XNOR XNORcc SRA (x = 0), SRAX א)‎ = 1) 








FPop2 - See TABLE C-6 
IMPDEP1 (VIS) - See TABLE C-12 
IMPDEP2 





94 UItraSPARC T1 Supplement * Draft D2.0, 17 Mar 2006 











TABLE C-3  0p3([5:0] (op = 2) (2 of 3) 
op3 {5:4} 
0 1 2 
8 lADDC ADDCcc RDYP (rs1 =0, i = 0(( JMPL 

— (rst =0,i=1) 
— (rsi=1) 
RDCCR (rst=2,i=0)) 
— (rst =2,i=1) 
RDASI (rs1=3, i = 0) 
— (rst =3,i=1) 
RDTICKP"* (rs1 = 4,i=0) 
— (rst =4,i=1) 
RDPC (rst = 5,i=0) 
— (rst =5,i=1) 
RDFPRS (rs1 = 6,i=0) 
— (rst =6,i=1) 
— (rs1 = 7-14) 
MEMBAR (rs1 = 15,rd=0,i = 1) 
STBARP (rs1 = 15,rd=0,i = 0) 

op3 — 081 = 15,rd > 0) 

{3:0} RDPCRP (rs1= 16) 








RDPIC (rs1=17) 
— (rs1 = 18) 

RDGSR (rs1- 19, i = 0) 

— (rs1 =19,i=1) 

— (rs1=20, 21) 
RDSOFTINTP (rst = 22, i = 0) 





— (rs1 =22,i=1) 

RDTICK CMPRP (rs1=23, i = 0) 

— (rs1 =23,i=1) 

RDSTICK? (rs1 = 24, i = 0) 

— (rst =24,i= 1) 

RDSTICK CMPRP (rs1=25, i= 0) 
(rs1 = 25,1 = 1) 

rd %asr26 (rs1- 26) 

— (rs1=27 - 31) 
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TABLE C-3  0p3/[5:0] (op = 2) (3 of 3) 


















op3 {5:4} 








2 3 
— (r$1 = 2, 4, 7 - 30) RETURN 
Tec {(i = 0 and inst{10:5} = 0) or 
— (rs1 = 15, 17 - 30) i = 1 and inst{10:8} = 0)]-See 


TABLE C-7 and TABLE C-11. 
— (((i = 0 and inst{10:5} > 0) or 
i = 1 and inst{10:8} > 0)} 


FLUSH 








SUBCcc 





RESTORE 
DONE (fen = 0) 
RETRYP (fcn = 1) 
— (fcn » 1) 


MOVcc See TABLE C-9 
SDIVX 


POPC (rs1 = 0) 
— (rs1 » 0) 









SDIVccP MOVr See TABLE C-8 











Shaded and the reserved (—) opcodes cause an illegal instruction trap. 
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TABLE 04 0p3Í5:0] (op = 3) 
op3{5:4} 

0 1 2 3 
0 LDUW LDUWAPes א תפז‎ 
1 LDUB LDUBAPAS — 
2 LDUH LDUHAPs LDQFAPAs! 
3 LDDP LDDAP: Pas LDDFAP^s 

— (rd odd) — (rd odd) See 8.6.4 XREF 
4 STW STWAP^s STFAP^s 
5 STB STBAPast STFSRP, STXFSR — 
— (rd » 1) 

(3:0) — (rd odd) — (rd odd) See 8.6.4 XREF 
s [ross we | -— | = 
* [os mm |  - | ב‎ 
A  ILDSH — 

B [LDX EN 

6 PES 

D  |LDSTUB LDSTUBAP^s PREFETCH PREFETCHAPAS 
— (fcn = 5-15) — (fcn = 5-15) 

E STX STXAP^s — CASXAP^s 

ZEE אא‎ | -— [ — 





LDOF, LDQFA, STOF, STOFA, and the reserved (—) opcodes cause an 
illegal instruction trap. 





TABLE C-5 opf{8:0} (op = 2,003 = 341, = FPop1) 


opf{2:0} 
opf{8:3} 3 4 


me [rov [uova ג‎ rer | 
0116 FABSs FABSd FABSq 
0316 
0446 
0546 FSQRTs |FSORTd |FSORTq 
0616 — — — 
0746 — — — 
0816 
0916 
16 
06 
06 
0% 
0% 
6 
1016 
11416 
1246 
1346 
1446 
1516 
1646 
1716 
1846 
1946 
1A16 
1-46 - - - - - - - - 










































































Shaded and reserved (—) opcodes cause an fp. exception other trap with FSR.ftt = 3 
(unimplemented FPop). 
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TABLE 0-6 opf{8:0} (op = 2, op3 = 3516 = FPop2) 





opf{3:0} 




















— |FMOVs )1001(| FMOVd )1001( — |FMOVsLEZ |FMOVdLEZ |FMOVqLEZ 
— |FCMPs FCMPd — |FCMPEs FCMPEd FCMPEq 


|; = 3j — |FMOVsLZ |FMOVdLZ |FMOVqLZ 














FMOVsNZ |FMOVdNZ |FMOVqNZ 








0 
0 
0 
0 
0 
0 





— |FMOVs )1003(| FMOVd (fcc3) |FMOVq )1003( | — 





FMOVsGZ |FMOVdGZ |FMOVqGZ 











LL—— — |FMOVsGEZ |FMOVdGEZ |FMOVqGEZ 





7 
8 
9 
A 
0 
פ‎ 
10 


|FMOVs (ice) |FMOVd (icc) |FMOVq (icc)‏ — ה 
— — | 11-17 


| 18 | — |FMOVs (xcc) |FMOVd (xcc) |FMOVq (xco) 


19-1F | — 




















t Reserved variation of FMOVr 


Shaded and reserved (—) opcodes cause an fp. exception other trap with FSR.ftt = 3 


(unimplemented, FPop). 





TABLE C-7 
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cond{3:0} 





FBPfcc 
































TABLE C-8 


rcond 
{2:0} 





Encoding of rcond{2:0} Instruction Field 









































BPr MOVr FMOVr 
op=0 op = 2 op = 2 
op2-3 op3 = 6 003 = 6 
0 = = = 
1 BRZ MOVRZ FMOVRZ 
2 BRLEZ MOVRLEZ FMOVRLEZ 
3 BRLZ MOVRLZ FMOVRLZ 
4 = um. = 
5 BRNZ MOVRNZ FMOVRNZ 
6 BRGZ MOVRGZ FMOVRGZ 
7 BRGEZ MOVRGEZ FMOVRGEZ 



































TABLE C-9 cc / opf cc Fields (MOVcc and FMOVcc) 
opf cc 
cc2 cc1 ccO Condition Code Selected 
0 0 0 fccO 
0 0 1 fcc 
0 1 0 fcc2 
0 T 1 fcc3 
1 0 0 icc 
1 0 1 — 
1 1 0 XCC 
1 1 1 — 














TABLE C-10 cc Fields (FBPfcc, FCMP, and FCMPE) 








cc1 cc0 Condition Code Selected 
0 0 000 
0 1 fcc1 
1 0 fcc2 
1 1 fcc3 
TABLE C-11 cc Fields (BPcc and Tcc) 
cc1 00 Condition Code Selected 
0 0 icc 
0 1 — 
1 0 xcc 
1 1 - 
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TABLE C-12 VIS Opcodes op = 2, 003 = 3616 = IMPDEP1 
























































opf (3:0) 
0 1 2 3 4 5 6 7 
00 |EDGE8 EDGESN EDGESL EDGESLN |EDGE16 EDGEI6N |EDGE16L EDGE16LN 
01 |ARRAYS ARRAY16 ARRAY32 
02 |FCMPLE16 FCMPNE16 FCMPLE32 FCMPNE32 
03 FMUL8X16 FMUL8X16AU FMUL8X16AL|FMUL8SUX16 | FMUL8ULX16 
04 
hus 05 |FPADD16 FPADD16S FPADD32  |FPADD32S8  |FPSUBI16 FPSUB16S — |FPSUB32 FPSUB32S 
06 |FZERO FZEROS FNOR FNORS FANDNOT2 |FANDNOT2S | FNOT2 FNOT2S 
07 |FAND FANDS FXNOR FXNORS FSRC1 FSRC1S FORNOT2  |FORNOT2S 
08 |SHUTDOWN |SIAM 
09.. 
1F 
opf {3:0} 
8 9 A B 0 D E F 
00 |EDGE32 EDGE32N EDGE32L | |EDGE32LN 
ALIGN BMASK ALIGNADDR| 
01 |ADDRESS ESS LITTLE 
02 |FCMPGT16 FCMPEQ16 FCMPGT32 FCMPEQ32 
m FMULDS8SUX16 MO FPACK32 FPACK16 FPACKFIX  |PDIST 
opf | 04 |FALIGNDATA FPMERGE |BSHUFFLE  |FEXPAND 
(8:4) 05 
06 |FANDNOT1 |FANDNOTI1S |FNOT1 FNOT1S FXOR FXORS FNAND FNANDS 
07 |FSRC2 FSRC2S FORNOT1 |FORNOTIS |FOR FORS FONE FONES 
08 
09.. 
1F 





























Note | An illegal instruction exception is generated if the undefined or 
shaded opcodes in the IMPDEP1 space are used. 
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APPENDIX D 


Instructions and Exceptions 


The instructions supported by UltraSPARC T1 and the exceptions they generate are 
listed in the UltraSPARC Architecture 2005 specification. 
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APPENDIX E 


IEEE 754 Floating Point Support 


UltraSPARC T1 conforms to the SPARC V9 Appendix B (IEEE Std 754-1985 
Requirements for SPARC-V9) recommendations. 


Note | UItraSPARC T1 detects tininess before rounding. 





E.1 


Special Operand Handling 


The UltraSPARC T1 FPU provides full hardware support for subnormal operands 
and results. Unlike UltraSPARC I/II and UltraSPARC III, UltraSPARC T1 will never 
generate an unfinished FPop trap type. Also, unlike UltraSPARC I/II and 
UltraSPARC III, UltraSPARC T1 does not implement a nonstandard floating-point 
mode. The ns bit of the FSR is always read as 0, and writes to it are ignored. 


The FPU generates +inf, —inf, «largest number, -largest number (depending on 
round mode) for overflow cases for multiply, divide, and add operations. 
For higher-to-lower precision conversion instructions {FDTOS}: 

a overflow, underflow, and inexact exceptions can be raised. 


a overflow is treated the same way as an unrounded add result; depending on 
the round mode, we will either generate the properly signed infinity or largest 
number. 


= underflow will produce a signed zero, smallest number, or subnormal result. 


For conversion to integer instructions {F(s,d)TOi, F(s,d)TOx}: UltraSPARC T1 follows 
SPARC V9 appendix B.5, pg 246. 


For NaN's: UltraSPARC T1 Follows SPARC V9 appendix B.2 (particularly Table 27) 
and B.5, pg 244-246. 


m Please note that Appendix B applies to those instructions listed in IEEE 754 
section 5: “All conforming implementations of this standard shall provide 
operations to add, subtract, multiply, divide, extract the sqrt, find the remainder, 
round to integer in fp format, convert between different fp formats, convert 
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between fp and integer formats, convert binary<->decimal, and compare. 
Whether copying without change of format is considered an operation is an 
implementation option.” 


m The instructions involving copying/moving of fp data (FMOV, FABS, and FNEG) 
will follow earlier UltraSPARC implementations by doing the appropriate sign bit 
transformation but will not cause an invalid exception nor do a rs2 = SNaN to 
rd = ONaN transformation. 


m Following UItraSPARC I/II implementations, all Fpops as defined in V9 will 
update cexc. All other instructions will leave cexc unchanged. 
m Following SPARC V9 Manual 5.1.7.6, 5.1.7.8, 5.1.7.9, and figures in 5.1.7.10 
Overflow Result is defined as: 
If the appropriate trap enable masks are not set (FSR.ofm = 0 and 
FSR.nxm = 0), then set aexc and cexc overflow and inexact flags: 
FSR.ofa = 1, FSR.nxa = 1, FSR.ofc = 1, FSR.nxc = 1. No trap is generated. 
If any or both of the appropriate trap enable masks are set (FSR.ofm = 1 or 
FSR.nxm = 1), then only an IEEE overflow trap is generated: FSR.ftt = 1. 
The particular cexc bit that is set diverges from UltraSPARC I/II to follow 
the SPARC V9 section 5.1.7.9 errata: 
If FSR.ofm = 0 and FSR.nxm = 1, then FSR.nxc = 1. 


If FSR.ofm = 1, independent of FSR.nxm, then FSR.ofc = 1 and 
FSR.nxc = 0. 


m Following SPARC V9 Manual 5.1.7.6, 5.1.7.8, 5.1.7.9, and figures in 5.1.7.10 
Underflow Result is defined as: 


If the appropriate trap enable masks are not set (FSR.ufm = 0 and 
FSR.nxm = 0), then set aexc and cexc underflow and inexact flags: 
FSR.ufa = 1, FSR.nxa = 1, FSR.ufc = 1, FSR.nxc = 1. No trap is generated. 


If any or both of the appropriate trap enable masks are set (FSR.ufm = 1 or 
FSR.nxm = 1), then only an IEEE underflow trap is generated: FSR.ftt = 1. 
The particular cexc bit that is set diverges from UltraSPARC I/II to follow 
the SPARC V9 section 5.1.7.9 errata: 


If FSR.ufm = 0 and FSR.nxm = 1, then FSR.nxc = 1. 


If FSR.ufm = 1, independent of FSR.nxm, then FSR.ufc = 1 and 
FSR.nxc = 0. 


The remainder of this section gives examples of special cases to be aware of that 
could generate various exceptions. 


E.1.1 Infinity Arithmetic 


Let “num” be defined as unsigned in the following tables. 
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E.1.1.1 One Infinity Operand Arithmetic 


₪ Do not generate exceptions 


TABLE E-1 One Infinity Operations That Do Not Generate Exceptions 





Cases 





inf plus +num - +inf 
+inf plus -num = «inf 
-inf plus «num = -inf 
-inf plus -num = -inf 


+inf minus +num = +inf 
+inf minus -num = +inf 
-inf minus +num = -inf 
-inf minus -num = -inf 


+inf multiplied by +num = +inf 
+inf multiplied by -num = -inf 
-inf multiplied by +num = -inf 
-inf multiplied by -num = +inf 


+inf divided by +num = +inf 
+inf divided by -num = -inf 
-inf divided by +num = -inf 
-inf divided by -num = +inf 


+num divided by +inf = +0 
+num divided by -inf = -0 
-num divided by +inf = -0 
-num divided by -inf = +0 





fstod, fdtos (+inf) = +inf 
fstod, fdtos (-inf) = -inF 


+inf divided by +0 = +inf 
+inf divided by -0 = -inf 
-inf divided by +0 = -inf 
-inf divided by -0 = +inf 
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TABLE E-1 One Infinity Operations That Do Not Generate Exceptions (Continued) 


Cases 





Any arithmetic operation involving infinity as 1 operand and a QNaN as the other operand: 
SPARC V9 B.2.2 Table 27 

(+/- inf) OPERATOR (ONaN2) = ONaN2 

(QNaN1) OPERATOR )+/- inf) = QNaN1 


Compares when other operand is not a NaN treat infinity just like a regular number: 
+inf = +inf, +inf > anything else; 

-inf = -inf, -inf < anything else. 

Affects following instructions: 

V9 fp compares (rs1 and/or rs2 could be +/- inf): 

* FCMPE 

* FCMP 


Compares when other operand is a QNaN, SPARC V9 A.13, B.2.1; fcc value = unordered = 
21 

fcmp(s/d) (+/- inf) with (ONaN2) - no invalid exception 

fcmp(s/d) (QNaN1) with (+/- inf) - no invalid exception 


₪ Could generate exceptions 


TABLE E-2 One Infinity Operations That Could Generate Exceptions 





Result (in addition to accrued 


Cases 


SPARC V9 Appendix B.5! 


Possible Exception 


IEEE 754 1 


exception) iF tem is cleared 


F{s,d}TOi (+inf) = invalid IEEE. 754 invalid 2314 
F{s,d}TOx (+inf) = invalid IEEE 754 invalid 263.1 
F{s,d}TOi (-inf) = invalid IEEE 754 invalid -231 
F{s,d}TOx (-inf) = invalid IEEE 754 invalid -263 


SPARC V9 B.2.2 IEEE 754 7.1 (No NaN operand result) 
+inf multiplied by +0 = invalid IEEE 754 invalid אגא‎ 
+inf multiplied by -0 = invalid IEEE 754 invalid אגא‎ 
-inf multiplied by +0 = invalid IEEE 754 invalid אגא‎ 
-inf multiplied by -0 = invalid IEEE 754 invalid אגא‎ 
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TABLE E-2 One Infinity Operations That Could Generate Exceptions (Continued) 


Cases 


SPARC V9 B.2.2 Table 27? 

Any arithmetic operation involving infinity 
as 1 operand and a SNaN as the other 
operand except copying/moving data 

(+/- inf) OPERATOR (SNaN2) 

(SNaN1) OPERATOR (+/- inf) 


SPARC V9 A.13, B.2.1? 
Any compare operation involving infinity as 


1 operand and a SNaN as the other operand: 


FCMP(s/d) (+/- inf) with (SNaN2) 
FCMP(s/d) (SNaN1) with (+/- inf) 


FCMPE(s/d) (+/- inf) with (SNaN2) 
FCMPE(s/d) (SNaN1) with (+/- inf) 


SPARC V9 A.13? 

Any compare & generate exception operation 
involving infinity as 1 operand and a QNaN 
as the other operand: 


FCMPE(s/d) (+/- inf) with (QNaN2) 
FCMPE(s/d) (QNaN1) with (+/- inf) 


Possible Exception 


IEEE 754 1 


IEEE 754 invalid 
IEEE 754 invalid 


IEEE 754 7.1 


IEEE 754 invalid 
IEEE 754 invalid 


IEEE 754 invalid 
IEEE 754 invalid 


IEEE 754 1 


IEEE 754 invalid 
IEEE 754 invalid 


Result (in addition to accrued 
exception) iF tem is cleared 


(One operand, a SNaN) 


QSNaN2 
QSNaN1 


fcc value = unordered = 1 
fcc value = unordered = 1 


fcc value = unordered = 1 
fcc value = unordered = 1 


fcc value = unordered = 1 
fcc value = unordered = 1 





1. Similar invalid exceptions also included in SPARC V9 B.5 are generated when the source operand is a NaN(QNaN or SNaN) or a re- 
sulting number that cannot fit in 32b[64b] integer format: (large positive argument >= 2°1[2°] or large negative argument <= -(2?! + 


1(]-)2*+1([ 


2. Note that in the IEEE 754 standard, infinity is an exact number; so this exception could also applies to non-infinity operands as well. 
Also note that the invalid exception and SNaN to QNaN transformation does not apply to copying/moving fpops (fmov fabs fneg). 


* 9 


E.1.1.2 Two Infinity Operand Arithmetic 


₪ Do not generate exceptions 


TABLE E-3 Two Infinity Operations That Do Not Generate Exceptions 





Cases 


+inf plus +inf = +inf 
-inf plus -inf = -inf 


+inf minus -inf = +inf 
-inf minus +inf = -inf 


+inf multiplied by +inf = +inf 
+inf multiplied by -inf = -inf 
-inf multiplied by +inf = -inf 
-inf multiplied by -inf = +inf 





Compares treat infinity just like a regular number: 
+inf = +inf, +inf > anything else; 

-inf = -inf, -inf < anything else. 

Affects following instructions: 

V9 fp compares (rs1 and/or rs2 could be +/- inf): 
* FCMPE 

* FCMP 





₪ Could generate exceptions 


TABLE E-4 Two Infinity Operations That Generate Exceptions 





Cases 


SPARC V9 B.2.2 
+inf plus -inf = invalid 
-inf plus +inf = invalid 


+inf minus +inf = invalid 
-inf minus -inf - invalid 


SPARC V9 B.22 

+inf divided by +inf = invalid 
+inf divided by -inf = invalid 
-inf divided by +inf = invalid 
-inf divided by -inf = invalid 


Possible Exception 


IEEE 754 7.1 
IEEE 754 invalid 
IEEE 754 invalid 


IEEE 754 invalid 
IEEE 754 invalid 


IEEE 754 7.1 

IEEE 754 invalid 
IEEE 754 invalid 
IEEE 754 invalid 
IEEE 754 invalid 


Result (in addition to 
accrued exception) 
if tem is cleared 


(No NaN operand result) 
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E.1.2 


Zero Arithmetic 


TABLE E-5 Zero Arithmetic Operations That Generate Exceptions 





Cases 


SPARC V9 B.2.2 & 5.1.7.10.4 
+0 divided by +0 = invalid 
*0 divided by -0 = invalid 
-0 divided by +0 = invalid 
-0 divided by -0 = invalid 





SPARC V9 5.1.7.10.4 


Possible Exception 


IEEE 754 1 

IEEE 754 invalid 
IEEE 754 invalid 
IEEE 754 invalid 
IEEE 754 invalid 


IEEE 754 2 


Result (in addition to 
accrued exception) 
if tem is cleared 


(No NaN operand 
result) 
QNaN 
QNaN 
QNaN 
QNaN 


+num divided by +0 = divide by zero 
+num divided by -0 = divide by zero 
-num divided by +0 = divide by zero 
-num divided by -0 = divide by zero 


IEEE 754 div by zero +inf 
IEEE 754 div by zero -inf 
IEEE 754 div by zero -inf 
IEEE 754 div. by. zero +inf 





SPARC V9 B.2.2 Table 27! 

Any arithmetic operation involving 
zero as 1 operand and 8 SNaN as the 
other operand except copying /moving 
data IEEE 754 invalid 
(+/- 0) OPERATOR (SNaN2) IEEE 754 invalid 
(SNaN1) OPERATOR (+/-0) 


IEEE 754 7.1 (One operand, a SNaN) 


QSNaN2 
QSNaN1 





1.In this context, 0 is again another exact number; so this exception could also applies to non-zero operands as well. Also note that the 
invalid exception and SNaN to QNaN transformation does not apply to copying/moving data instructions (FMOV, FABS, FNEG) 


TABLE E-6 Interesting Zero Arithmetic Sign Result Case 


Cases 





+0 plus -0 = +0 for all round modes except round to -infinity where the 
result is -0. 


E.1.3 NaN Arithmetic 


m Do not generate exceptions 


TABLE 7ם‎ NaN Arithmetic Operations that do not generate exceptions 


Cases 





SPARC V9 B.2.1: Fp convert to wider NaN transformation 
FsTOd (QNaN2) = QNaN2 widened 
FsTOd(0x7fd10000) = 0x7ffa2000 8’hO 
FsTOd(0xffd10000) = Oxfffa2000 8'hO 


SPARC V9 B.2.1: Fp convert to narrower NaN transformation 
FdTOs (QNaN2) = QNaN2 narrowed 

FdTOs(0x7ffa2000 8'h0) = 0 

FdTOs(0xfffa2000 8'h0) = Oxffd1000 


SPARC V9 B.2.2 Table 27 
Any non-compare arithmetic operations --result takes sign of QNaN pass through operand. 


(+/- num) OPERATOR (ONaN2) = QNaN2 
(ONaN1) OPERATOR (+/- num) = ONaN1 


(ONaN1) OPERATOR (ONaN2) = ONaN2 
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m Could generate exceptions 


TABLE E-8 NaN Arithmetic Operations That Could Generate Exceptions 





Cases 


Possible Exception 


Result (in addition 
to accrued 
exception) 

if tem is cleared 





SPARC V9 B.2.1: Fp convert to wider NaN transformation 
FsTOd (SNaN2) = QSNaN2 widened 
FsTOd(0x7f910000) = 0x7ffa2000 8’h0 
FsTOd(0xff910000) = Oxfffa2000 8'hO 


SPARC V9 B.2.1: Fp convert to narrower NaN transformation 
FdTos (SNaN2) = QSNaN2 narrowed 

FdTos(0x7ff22000 8'h0) = 0 

FdTos(0xfff22000 8'h0) = 0 


SPARC V9 8.2.2 Table 27 

Any non-compare arithmetic operations except copying / 
moving (fmov, fabs, fneg) 

(+/- num) OPERATOR (SNaN2) 


(SNaN1) OPERATOR (+/- num) 
(SNaN1) OPERATOR (SNaN2) 
(QNaN1) OPERATOR (SNaN2) 


(SNaN1) OPERATOR (QNaN2) 


SPARC V9 Appendix B.5 
F{s,d}TOi (+QNaN) = invalid 
F{s,d}TOi (+SNaN) = invalid 
}TOx (+QNaN) = invalid 
JTOx (+SNaN) = invalid 








TOi (-QNaN) = invalid 
TOi (-SNaN) = invalid 

TOx (-QNaN) = invalid 
TOx (-SNaN) = invalid 














HH i ni nd 
o oo o ₪0 
EEE 


IEEE 7547.1 


IEEE 754 invalid 


IEEE 7547.1 


IEEE 754 invalid 


IEEE 7547.1 


IEEE 754 invalid 
IEEE 754 invalid 
IEEE 754 invalid 
IEEE 754 invalid 


IEEE 754 invalid 


IEEE 7547.1 

IEEE 754 invalid 
IEEE 754 invalid 
IEEE 754 invalid 
IEEE 754 invalid 


IEEE 754 invalid 
IEEE 754 invalid 
IEEE 754 invalid 
IEEE 754 invalid 


QSNaN2 
widened 


QSNaN2 
narrowed 


QSNaN2 
QSNaN1 
QSNaN2 


QSNaN2 














QSNaN1 


2314 
2314 
2689.1 
263.1 


-231 
-231 
-268 
-263 





E.1.4 Special Inexact Exceptions 


UltraSPARC T1 Follows SPARC V9 5.1.7.10.5 (IEEE_754 Section 7.5) and sets 
FSR_inexact whenever the rounded result of an operation differs from the infinitely 


precise unrounded result. 
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Additionally, there are a few special cases to be aware of: 


TABLE E-9 


Fp <-> Int Conversions With Inexact Exceptions 





Cases 


SPARC V9 A.14: Fp convert to 32b integer when source operand 
lies between -(23-1) and 231, but is not exactly an integer 
FsTOi, FdTOi 


SPARC V9 A.14: Fp convert to 64b integer when source operand 
lies between -(2°-1) and 299, but is not exactly an integer 
FsTOx, FdTOx 


SPARC V9 A.15: Convert integer to fp format when 32b integer 
source operand magnitude is not exactly representable in single 
precision (23b mantissa). Note, even if the operand is > 2244, if 
enough of its trailing bits are zeros, it may still be exactly 
representable. 

FiTOs 


SPARC V9 A.15: Convert integer to fp format when 64b integer 
source operand magnitude is not exactly representable in single 
precision (23b mantissa). Note, even if the operand is > 2241, if 
enough of its trailing bits are zeros, it may still be exactly 
representable. 

FxTOs 


SPARC V9 A.15: Convert integer to fp format when 64b integer 
source operand magnitude is not exactly representable in double 
precision (52b mantissa). Note, even if the operand is » 2534, if 
enough of its trailing bits are zeros, it may still be exactly 
representable. 

FxTOd 


Possible Exception 


IEEE 754 7.5 


IEEE 754 inexact 
IEEE 754 7.5 


IEEE 754 inexact 
IEEE 754 7.5 


IEEE 754 inexact 


IEEE 754 7.5 


IEEE 754 inexact 


IEEE 754 7.5 


IEEE 754 inexact 


Result (in addition to 
accrued exception) 
if tem is cleared 


An integer number 


An integer number 


A SP number 


A SP number 


A DP number 








E.2 


Subnormal Handling 


The UltraSPARC T1 FPU provides full hardware support for subnormal operands 
and results. Unlike UltraSPARC I/II and UItraSPARC III, UltraSPARC T1 will never 


generate an unfinished FPop trap type. 
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APPENDIX F 


Caches and Cache Coherency 


Chapter Revision History 














Date By Comment 

28 Aug 03 Started outline of appendix. 

22 Sep 03 Copied and reformatted Chapter 8, Cache and Memory Interactions, 
from 115111 User Manual. 

25 Nov 03 Initial changes for UltraSPARC T1 

4 Jun 04 More information on unit of coherence. 

2 Aug 04 Add cache index information. 

20 Dec 04 Better documentation of how to flush L2. 

11 Feb 05 M. L. Nohr Converted to UltraSPARC Architecture format 











This appendix describes various interactions between the caches and memory, and 
the management processes that an operating system must perform to maintain data 
integrity in these cases. In particular, it discusses the following subjects: 


Invalidation of one or more cache entries - when and how to do it 
Differences between cacheable and noncacheable accesses 
Ordering and synchronization of memory accesses 

Accesses to addresses that cause side effects (I/O accesses) 
Nonfaulting loads 

Cache sizes, associativity, replacement policy, etc. 





F.1 Cache Flushing 


Data in the level-1 (read-only or write-through) caches can be flushed by 
invalidating the entry in the cache. Modified data in the level-2 (writeback) cache 
must be written back to memory when flushed. 


Cache flushing is required in the following cases: 
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2 


m I-cache: Flush is needed before executing code that is modified by a local store 
instruction other than block commit store, see Section 3.1.1.1, "Instruction Cache 
(I-cache)." This is done with the FLUSH instruction or by using ASI accesses. 
When ASI accesses are used, software must ensure that the flush is done on the 
same virtual core as the stores that modified the code space. 


m D-cache: Flush is needed when a physical page is changed from (virtually) 
cacheable to (virtually) noncacheable. This is done with a displacement flush 
(Displacement Flushing on page 116). 


m L2 cache: Flush is needed for stable storage. Examples of stable storage include 
battery-backed memory and transaction logs. This is done with a displacement 
flush (see Displacement Flushing on page 116). Flushing the L2 cache flushes the 
corresponding blocks from the I- and D-caches because UltraSPARC T1 maintains 
inclusion between the L2 and L1 caches. 


Displacement Flushing 


Cache flushing can be accomplished by a displacement flush. This is done by placing 
the cache in direct-map mode, and reading a range of read-only addresses that map 
to the corresponding cache line being flushed, forcing out modified entries in the 
local cache. Care must be taken to ensure that the range of read-only addresses is 
mapped in the MMU before starting a displacement flush; otherwise, the TLB miss 
handler may put new data into the caches. In addition, the range of addresses used 
to force lines out of the cache must not be present in the cache when starting the 
displacement flush (if any of the displacing lines are present before starting the 
displacement flush, fetching the already present line will not cause the proper way in 
the direct-mapped mode L2 to be loaded, instead the already present line will stay at 
its current location in the cache.) 


Note | Diagnostic ASI accesses to the L2 cache can be used to invalidate 
a line, but they are generally not an alternative to displacement 
flushing. Modified data in the L2 cache will not be written back 
to memory using these ASI accesses. 


Memory Accesses and Cacheability 


Note | Atomic load-store instructions are treated as both a load and a 
store; they can be performed only in cacheable address spaces. 
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Coherence Domains 


Two types of memory operations are supported in UltraSPARC Tl: cacheable and 
noncacheable accesses, as indicated by the page translation. Cacheable accesses are 
inside the coherence domain; noncacheable accesses are outside the coherence 
domain. 


SPARC V9 does not specify memory ordering between cacheable and noncacheable 
accesses. In TSO mode, UltraSPARC T1 maintains TSO ordering, regardless of the 
cacheability of the accesses. For SPARC V9 compatibility while in PSO or RMO 
mode, 8 MEMBAR #Lookaside should be used between a store and a subsequent 
load to the same noncacheable address. See the SPARC Architecture Manual, Version 9 
for more information about the SPARC V9 memory models. 


On UltraSPARC T1, a MEMBAR #Lookaside executes more efficiently than a 
MEMBAR #StoreLoad. 


F.1.3.1 Cacheable Accesses 


Accesses that fall within the coherence domain are called cacheable accesses. They 
are implemented in UltraSPARC 11 with the following properties: 


₪ Data resides in real memory locations. 
₪ They observe supported cache coherence protocol. 


m The unit of coherence is 64 bytes at the system level (coherence between the 
virtual processors and I/O), enforced by the L2 cache. 


m The unit of coherence for the primary caches (coherence between multiple virtual 
processors) is the primary cache line size (16 bytes for the data cache, 32 bytes for 
the instruction cache), enforced by the L2 cache directories. 


F.1.3.2  Noncacheable and Side-Effect Accesses 


Accesses that are outside the coherence domain are called noncacheable accesses. 
Accesses of some of these memory (or memory mapped) locations may result in side 
effects. Noncacheable accesses are implemented in UltraSPARC T1 with the 
following properties: 


m Data may or may not reside in real memory locations. 


m Accesses may result in program-visible side effects; for example, memory- 
mapped I/O control registers in a UART may change state when read. 


m Accesses may not observe supported cache coherence protocol. 


m The smallest unit in each transaction is a single byte. 


Noncacheable accesses with the e bit set (that is, those having side-effects) are all 
strongly ordered with respect to other noncacheable accesses with the e bit set. 
Speculative loads with the e bit set cause a data access exception trap. 


Note | The side-effect attribute does not imply noncacheability. 


F.1.3.3 Global Visibility and Memory Ordering 


To ensure the correct ordering between the cacheable and noncacheable domains, 
explicit memory synchronization is needed in the form of MEMBARs or atomic 
instructions. CODE EXAMPLE F-1 illustrates the issues involved in mixing cacheable 
and noncacheable accesses. 


CODE EXAMPLE F-1 Memory Ordering and MEMBAR Examples 


Assume that all accesses go to non-side-effect memory locations. 
Process A: 

While (1) 

{ 


Store Dl:data produced 
1 MEMBAR #StoreStore (needed in PSO, RMO) 
Store Fl:set flag 
While Fl is set (spin on flag) 
Load F1 
2 MEMBAR #LoadLoad | #LoadStore (needed in RMO) 








Load D2 


Process: 1 
While (1) 
( 


While F1 is cleared (spin on flag) 


Load F1 
2 MEMBAR #LoadLoad | #LoadStore (needed in RMO) 





Load D1 


Store D2 
1 MEMBAR #StoreStore (needed in PSO, RMO) 





Store Fl:clear flag 
} 
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Note | A MEMBAR #MemIssue or MEMBAR t Sync is needed if 
ordering of cacheable accesses following noncacheable accesses must 
be maintained in PSO or RMO. 


In TSO mode, loads and stores (except block stores) cannot pass earlier loads, and 
stores cannot pass earlier stores; therefore, no MEMBAR is needed. 


In PSO mode, loads are completed in program order, but stores are allowed to pass 
earlier stores; therefore, only the MEMBAR at 41 is needed between updating data 
and the flag. 


In RMO mode, there is no implicit ordering between memory accesses; therefore, the 
MEMBARs at both #1 and #2 are needed. 


Memory Synchronization: MEMBAR and FLUSH 


The MEMBAR (STBAR in SPARC V8) and FLUSH instructions are provide for 
explicit control of memory ordering in program execution. MEMBAR has several 
variations; their implementations in UltraSPARC Tl are described below. 


m MEMBAR #LoadLoad — Forces all loads after the MEMBAR to wait until all 
loads before the MEMBAR have reached global visibility. 


m MEMBAR itStoreLoad — Forces all loads after the MEMBAR to wait until all 
stores before the MEMBAR have reached global visibility. 
m MEMBAR #LoadStore — Forces all stores after the MEMBAR to wait until all 
loads before the MEMBAR have reached global visibility. 


m MEMBAR #StoreStore and STBAR — Forces all stores after the MEMBAR to 
wait until all stores before the MEMBAR have reached global visibility. 





Notes | (1) STBAR has the same semantics as MEMBAR #StoreStore; it 
is included for SPARC V8 compatibility. 


(2) The above four MEMBARSs do not guarantee ordering between 
cacheable accesses after noncacheable accesses. 


m MEMBAR #Lookaside — SPARC V9 provides this variation for 
implementations having virtually tagged store buffers that do not contain 
information for snooping. 


Note | For SPARC V9 compatibility, this variation should be used before 
issuing 8 load to an address space that cannot be snooped. 


= MEMBAR #MemIssue — Forces all outstanding memory accesses to be completed 
before any memory access instruction after the MEMBAR is issued. It must be 
used to guarantee ordering of cacheable accesses following noncacheable 
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accesses. For example, I/O accesses must be followed by a MEMBAR #MemIssue 
before subsequent cacheable stores; this ensures that the I/O accesses reach global 
visibility before the cacheable stores after the MEMBAR. 


MEMBAR #MemIssue is different from the combination of MEMBAR 
#LoadLoad | #LoadStore | #StoreLoad | #StoreStore. MEMBAR 
#MemIssue orders cacheable and noncacheable domains; it prevents memory 
accesses after it from issuing until it completes. 


m MEMBAR #Sync (Issue Barrier) — Forces all outstanding instructions and all 
deferred errors to be completed before any instructions after the MEMBAR are 
issued. 


Note | MEMBAR #Sync is a costly instruction; unnecessary usage may 
result in substantial performance degradation. 


See the references to "Memory Barrier," "The MEMBAR Instruction," and 
"Programming With the Memory Models," in The SPARC Architecture Manual, 
Version 9 for more information. 


E1.41  Self-Modifying Code (FLUSH) 


The SPARC V9 instruction set architecture does not guarantee consistency between 
code and data spaces. A problem arises when code space is dynamically modified by 
a program writing to memory locations containing instructions. LISP programs and 
dynamic linking require this behavior. SPARC V9 provides the FLUSH instruction to 
synchronize instruction and data memory after code space has been modified. 


In UltraSPARC T1, a FLUSH behaves like a store instruction for the purpose of 
memory ordering. In addition, all instruction fetch (or prefetch) buffers are 
invalidated. The issue of the FLUSH instruction is delayed until previous (cacheable) 
stores are completed. Instruction fetch (or prefetch) resumes at the instruction 
immediately after the FLUSH. 


Atomic Operations 


SPARC V9 provides three atomic instructions to support mutual exclusion. These 
instructions behave like both a load and a store but the operations are carried out 
indivisibly. Atomic instructions may be used only in the cacheable domain. 


An atomic access with a restricted ASI in nonprivileged mode (PSTATE.priv = 0) 
causes a privileged action trap. An atomic access with a noncacheable address 
causes a data access exception trap. An atomic access with an unsupported ASI 
causes a data access exception trap. TABLE F-1 lists the ASIs that support atomic 
accesses. 
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TABLE =1  ASIs that Support SWAP, LDSTUB, and CAS 








ASI Name Access 

ASI NUCLEUS( LITTLE] Restricted 
ASI, AS IF USER PRIMARY( LITTLE) Restricted 
ASI, AS IF USER SECONDARY( LITTLE] Restricted 
ASI PRIMARY| LITTLE) Unrestricted 
ASI SECONDARY| LITTLE] Unrestricted 
ASI REAL( LITTLE] Unrestricted 








Note | Atomic accesses with nonfaulting ASIs are not allowed, because these 
ASIs have the load-only attribute. 


F.1.5.1 SWAP Instruction 


SWAP atomically exchanges the lower 32 bits in an integer register with a word in 
memory. This instruction is issued only after store buffers are empty. Subsequent 
loads interlock on earlier SWAPs. A cache miss allocates the corresponding line. 


Note | If a page is marked as virtually noncacheable but physically 
cacheable, allocation is done to the L2 cache only. 


E1.5.2  LDSTUB Instruction 


LDSTUB behaves like SWAP, except that it loads a byte from memory into an integer 
register and atomically writes all ones (FFjg) into the addressed byte. 


F.1.5.3 Compare and Swap (CASX) Instruction 


Compare-and-swap combines a load, compare, and store into a single atomic 
instruction. It compares the value in an integer register to a value in memory; if they 
are equal, the value in memory is swapped with the contents of a second integer 
register. All of these operations are carried out atomically; in other words, no other 
memory operation may be applied to the addressed memory location until the entire 
compare-and-swap sequence is completed. 


Nonfaulting Load 


A nonfaulting load behaves like a normal load, except as follows: 
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m It does not allow side-effect access. An access with the e bit set causes a 
data access exception trap. 


m It can be applied to a page with the nfo bit set; other types of accesses will cause 
a data access exception trap. 
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APPENDIX G 


Glossary 


This chapter defines concepts and terminology common to all implementations of 
SPARC V9. It also includes terms that are unique to the UltraSPARC T1 
implementation. 


ALU 
address space identifier 


(ASD 


application program 


architectural state 
ARF 

ASI 

ASR 


big-endian 


BLD 
blocking ASI 


branch outcome 


Arithmetic Logical Unit 


An 8-bit value that identifies an address space. For each instruction or data 
access, the integer unit appends an ASI to the address. See also implicit ASI. 


A program executed with the processor in nonprivileged mode. Note: 
Statements made in this specification regarding application programs may not 
be applicable to programs (for example, debuggers) that have access to 
privileged processor state (for example, as stored in a memory-image dump). 


Software-visible registers and memory (including caches). 
Architectural Register File. 

Address Space Identifier. 

Ancillary State Register. 


An addressing convention. Within a multiple-byte integer, the byte with the 
smallest address is the most significant; a byte's significance decreases as its 
address increases. 


(Obsolete) abbreviation for Block Load instruction; replaced by LDBLOCKF. 


An ASI that will access its ASI register /array location once all older 
instructions in that strand have retired, there are no instructions in the other 
strand which can issue, and the store queue, TSW, and LMB are all empty. 
Additionally, the snoop pipeline is stalled before accessing the ASI register/ 
array location. See nonblocking ASI. 


Refers to whether or not a branch instruction will alter the flow of execution 
from the sequential path. A taken branch outcome results in execution 
proceeding with the instruction at the branch target; a not-taken branch 
outcome results in execution proceeding with the instruction along the 
sequential path after the branch. 
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branch resolution 


branch target address 
BST 


bypass ASI 


byte 
CCR 


clean window 


CMP 


coherence 


commit 


completed 


complex instruction 


consistency 


context 


CWP 
CPI 


CPU 
cross-call 
CSR 

CTI 


A branch is said to be resolved when the result (that is, the branch outcome 
and branch target address) has been computed and is known for certain. 
Branch resolution can take place late in the pipeline. 


The address of the instruction to be executed if the branch is taken. 
(Obsolete) abbreviation for Block Store instruction; replaced by STBLOCKF. 


An ASI that refers to memory and for which the MMU does not perform 
virtual-to-real address translation (that is, memory is accessed using a direct 
real address). 


Eight consecutive bits of data. 
Condition Codes register 


A register window in which all of the registers contain 0, a valid address from 
the current address space, or valid data from the current address space. 


Chip multiprocessor. Å single chip processor that contains more than one 
virtual processor. See also processor and virtual processor. 


A set of protocols guaranteeing that all memory accesses are globally visible to 
all caches in a shared-memory system. 


An instruction commits when it modifies architectural state. 


A memory transaction is said to be completed when an idealized memory has 
executed the transaction with respect to all processors. A load is considered 
completed when no subsequent memory transaction can affect the value 
returned by the load. A store is considered completed when no subsequent 
load can return the value that was overwritten by the store. 


An instruction that requires the creation of secondary "helper" instructions for 
normal operation, excluding trap conditions such as spill/fill traps (which use 
helpers). 


See coherence. 


A set of translations that supports a particular address space. See also Memory 
Management Unit (MMU). 


Current Window Pointer 


Cycles per instruction. The number of clock cycles it takes to execute an 
instruction. 


Central Processing Unit. A synonym for virtual processor. 
An interprocessor call in a multiprocessor system. 
Control Status Register. 


Control transfer instruction 
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current window 


DCTI 
DDR 


deprecated 


DFT 
doublet 


doubleword 


DTLB 


even parity 


exception 


extended word 


EXU 


F register 


fcen 
FGU 


floating-point 
exception 


floating-point IEEE-754 
exception 


The block of 24 R registers that is currently in use. The Current Window 
Pointer (CWP) register points to the current window. 


Delayed control transfer instruction. 
Double Data Rate. 


The term applied to an architectural feature (such as an instruction or register) 
for which a SPARC V9 implementation provides support only for compatibility 
with previous versions of the architecture. Use of a deprecated feature must 
generate correct results but may compromise software performance. 
Deprecated features should not be used in new SPARC V9 software and may 
not be supported in future versions of the architecture. 


Designed for test. 
Two bytes (16 bits) of data. 


An aligned octlet. Note: The definition of this term is architecture dependent 
and may differ from that used in other processor architectures. 


Data Cache Translation lookaside buffer. 


The mode of parity checking in which each combination of data bits plus a 
parity bit contains an even number of set bits. 


A condition that makes it impossible for the processor to continue executing 
the current instruction stream without software intervention. See also trap. 


An aligned octlet, nominally containing integer data. Note: The definition of 
this term is architecture dependent and may differ from that used in other 
processor architectures. 


Execution Unit 


A floating-point register. SPARC V9 includes single-, double-, and quad- 
precision F registers. 


One of the floating-point condition code fields 1000, 1001, 1002, or fcc3. 


Floating-point and Graphics Unit. 


An exception that occurs during the execution of an FPop instruction as 
defined by the Fpop1, Fpop2, IMPDEP1, and IMPDEP2 opcodes. The 
exceptions are unfinished, FPop, unimplemented FPop, sequence error, 
hardware error, invalid fp register, or IEEE 754 exception. 


A floating-point exception, as specified by IEEE Std 754-1985. Listed within 
this specification as IEEE 754 exception. 
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floating-point operate 
(FPop) instructions 


floating-point trap 
type 


floating-point unit 


FP 

FPRS 
FRF 

FSR 

GL 

GSR 
halfword 


helper 


hexlet 


hyperprivileged 


IFU 


implementation 


implementation 
dependent 


implicit ASI 


Instructions that perform floating-point and graphics calculations, as defined 
by the FPop1, FPop2,and IMPDEP!1 opcodes. FPop instructions do not include 
FBfcc instructions, loads and stores between memory and the floating-point 
unit, or instructions defined by the IMPDEP2 opcodes. 


The specific type of a floating-point exception, encoded in the FSR ftt field. 


A processing unit that contains the floating-point registers and performs 
floating-point operations, as defined by this specification. 


Floating Point 

Floating Point Register State (register). 
Floating-point register file. 
Floating-Point Status register. 
Global-Level register. 

General Status register. 


An aligned doublet. Note: The definition of this term is architecture dependent 
and may differ from that used in other processor architectures. 


An instruction generated by the IRU in response to a complex instruction. 
Helper instructions are not visible to software. Refer to Instruction Latencies on 
page 76 for a complete list of all complex instructions and their helper 
sequences. 


Sixteen bytes (128 bits) of data. 


An adjective that describes the state of the processor when it is executing in 
hyperprivileged mode 


Instruction Fetch Unit. 


Hardware or software that conforms to all of the specifications of an 
instruction set architecture (ISA). 


An aspect of the architecture that can legitimately vary among 
implementations. In many cases, the permitted range of variation is specified 
in the SPARC V9 standard. When a range is specified, compliant 
implementations must not deviate from that range. 


The address space identifier that is supplied by the hardware on all instruction 
accesses and on data accesses that do not contain an explicit ASI or a reference 
to the contents of the ASI register. 
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informative appendix 


initiated 
instruction field 


instruction set 
architecture 


integer unit (IU) 


interrupt request 
IRF 
IRU 
ISA 


issue 


L2C 


leaf procedure 


little-endian 


load 


load-store 


An appendix containing information that is useful but not required to create an 
implementation that conforms to the SPARC V9 specification. See also 
normative appendix. 


Synonym: issued. 


A bit field within an instruction word. 


A set that defines instructions, registers, instruction and data memory, the 
effect of executed instructions on the registers and memory, and an algorithm 
for controlling instruction execution. Does not define clock cycle times, cycles 
per instruction, data paths, etc. The bulk of the ISA implemented by 
UltraSPARC Tl is defined in the UltraSPARC Architecture 2005; a few 
extensions are described in this document. 


A processing unit that performs integer and control-flow operations and 
contains general-purpose integer registers and processor state registers, as 
defined by this specification. 


A request for service presented to the processor by an external device. 


Integer register file. 


Instruction set architecture. 


Used to describe the act of conveying an instruction from the instruction fetch 
unit for execution on the pipeline. 


Level 2 Cache. 


A procedure that is a leaf in the programs call graph; that is, one that does not 
call (by using CALL or JMPL) any other procedures. 


An addressing convention. Within a multiple-byte integer, the byte with the 
smallest address is the least significant; a byte's significance increases as its 
address increases. 


An instruction that reads (but does not write) memory or reads (but does not 
write) location(s) in an alternate address space. Load includes loads into integer 
or floating-point registers, block loads, Load Quadword Atomic, and alternate 
address space variants of those instructions. See also load-store and store, the 
definitions of which are mutually exclusive with load. 


An instruction that explicitly both reads and writes memory or explicitly reads 
and writes location(s) in an alternate address space. Load-store includes 
instructions such as CASA, CASXA, LDSTUB, LDSTUBA and the deprecated 
SWAP and SWAPA instructions. See also load and store, the definitions of 
which are mutually exclusive with load-store. 
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may 


Memory Management 
Unit (MMU) 


must 
next program counter 
(NPC) 

NFO 


nonblocking ASI 


nonfaulting load 


nonprivileged 


nonprivileged mode 


normative appendix 


nontranslating ASI 


NPC 
npt 
N REG WINDOWS 


octlet 


A keyword indicating flexibility of choice with no implied preference. Note: 
“May” indicates that an action or operation is allowed; “can” indicates that it is 
possible. 


The address translation hardware that translates 64-bit virtual address into real 
addresses. See also context, physical address, and virtual address. 


Synonym: shall. 


A register that contains the address of the instruction to be executed next if a 
trap does not occur. 


Nonfault access only. 


An ASI that will access its ASI register/array location once all older 
instructions in that strand have retired and there are no instructions in the 
other strand which can issue. See blocking ASI. 


A load operation that, in the absence of faults or in the presence of a 
recoverable fault, completes correctly, and in the presence of a nonrecoverable 
fault returns (with the assistance of system software) a known data value 
(nominally zero). See speculative load. 


An adjective that describes: 

(1) the state of the processor when PSTATE.priv = 0, that is, nonprivileged 
mode; 

(2) processor state information that is accessible to software while the 
processor is in either privileged mode or nonprivileged mode; for example, 
nonprivileged registers, nonprivileged ASRs, or, in general, nonprivileged 
state; 

(3) an instruction that can be executed when the processor is in either 
privileged mode or nonprivileged mode. 


The mode in which a processor is operating when PSTATE.priv = 0. See also 
privileged. 


An appendix containing specifications that must be met by an implementation 
conforming to the SPARC V9 specification. See also informative appendix. 


An ASI that does not refer to memory (for example, refers to control/status 
register(s)) and for which the MMU does not perform address translation. 


Next program counter. 
Nonprivileged trap. 
The number of register windows present in a particular implementation. 


Eight bytes (64 bits) of data. Not to be confused with "octet," which has been 
commonly used to describe eight bits of data. In this document, the term byte, 
rather than octet, is used to describe eight bits of data. 
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odd parity 


older instruction 


one-hot 


opcode 


optional 


PC 
PCR 
PIC 
PIL 


prefetchable 


privileged 


privileged mode 


The mode of parity checking in which each combination of data bits plus 8 
parity bit contains an odd number of set bits. 


Refers to the relative fetch order of instructions. Instruction i is older than 
instruction j if instruction i was fetched before instruction j. Data dependencies 
flow from older instructions to younger instructions, and an instruction can 
only be dependent upon older instructions. See younger instruction. 


An n-bit binary signal is one-hot if, and only if, n — 1 of the bits are each 0 and 
a single bit is 1. 


A bit pattern that identifies a particular instruction. 


A feature not required for compliance to an architecture specification (such as 
UltraSPARC Architecture 2005 or SPARC V9). 


Program counter. 

Performance Control Register. 
Performance Instrumentation Counter. 
Processor Interrupt Level. 


(1) An attribute of a memory location that indicates to an MMU that 
PREFETCH operations to that location may be applied. 

(2) A memory location condition for which the system designer has 
determined that no undesirable effects will occur if a PREFETCH operation to 
that location is allowed to succeed. Typically, normal memory is prefetchable. 


Nonprefetchable locations include those that, when read, change state or cause 
external events to occur. For example, some I/O devices are designed with 
registers that clear on read; others have registers that initiate operations when 
read. See side effect. 


An adjective that describes 

(1) the state of the processor when it is executing in privileged mode ( 
PSTATE.priv = 1); 

(2) processor state that is only accessible to software while the processor is in 
privileged mode; for example, privileged registers, privileged ASRs, privileged 
ASIs, or in general, privileged state; 

(3) an instruction that can be executed only when the processor is in privileged 
mode. 


The mode in which a processor is operating when PSTATE.priv = 1. See also 
nonprivileged. 
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processor 


program counter 
(PC) 

PSO 

quadlet 


quadword 


RA 
RAS 


RAW 

R register 
rd 

RDPR 


real address 


The unit on which a shared interface is provided to control the configuration 
and execution of a collection of strands. A processor contains one or more 
physical cores, each of which contains one or more strands. On a more physical 
side, a processor is a physical module that plugs into a system. A processor is 
expected to appear logically as a single agent on the system interconnect fabric. 
Therefore, a simple processor, like an UltraSPARC I processor, that can only 
execute one thread at a time would be a processor with a single physical core 
that is single-stranded. A processor that follows the academic model of 
simultaneous multithreading (SMT) would be a processor with a single 
physical core, where that physical core supports multiple strands in order to 
execute multiple threads at the same time (multi-stranded physical core). A 
processor that follows the academic model of a CMP would be a processor 
with multiple physical cores, each only supporting a single strand. 

One can also have multiple physical cores where each physical core is multi- 
stranded. UltraSPARC T1 is an example of the latter, where each UltraSPARC 
T1 processor contains eight physical cores, each of which contains four strands. 


A register that contains the address of the instruction currently being executed 
by the IU. 


Partial store order. 
Four bytes (32 bits) of data. 


Aligned hexlet. Note: The definition of this term is architecture dependent and 
may be different from that used in other processor architectures. 


Real address. 


Return Address Stack; 
also Reliability, Availability and Serviceability. 


Read After Write 

An integer register. Also called a general-purpose register or working register. 
Rounding direction. 

Read Privileged Register instruction 


An address used by privileged mode code to describe the underlying physical 
memory. Real address are usually translated by a combination of 
hyperprivileged hardware and software to physical addresses which can be 
used to access real physical memory or I/O device space. 
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reserved 


restricted 


51, rs2, rd 


RMO 
RTO 
RTS 
shall 


should 


SIAM 


side effect 


SIMD 


Describing an instruction field, certain bit combinations within an instruction 
field, or a register field that is reserved for definition by future versions of the 
architecture. 


Reserved instruction fields shall read as 0, unless the implementation supports 
extended instructions within the field. The behavior of SPARC V9 processors 
when they encounter nonzero values in reserved instruction fields is 
undefined. 


Reserved bit combinations within instruction fields are defined in Chapter 5, 
Instruction Definitions. In all cases, SPARC V9 processors shall decode and trap 
on these reserved combinations. 


Reserved register fields should always be written by software with values of 
those fields previously read from that register or with zeroes; they should read 
as zero in hardware. Software intended to run on future versions of SPARC V9 
should not assume that these fields will read as 0 or any other particular value. 
Throughout this specification, figures and tables illustrating registers and 
instruction encodings indicate reserved fields and combinations with an em 
dash (—). 


Describing an address space identifier (ASI) that may be accessed only while 
the processor is operating in privileged mode. 


The integer or floating-point register operands of an instruction. rs1 and rs2 
are the source registers; rd is the destination register. 


Relaxed memory order. 
Read to own (cache line state). 
Read to share (cache line state). 


A keyword indicating a mandatory requirement. Designers shall implement all 
such mandatory requirements to ensure interoperability with other SPARC V9- 
compliant products. Synonym: must. 


A keyword indicating flexibility of choice with a strongly preferred 
implementation. Synonym: it is recommended. 


Set interval arithmetic mode instruction. 


The result of a memory location having additional actions beyond the reading 
or writing of data. A side effect can occur when a memory operation on that 
location is allowed to succeed. Locations with side effects include those that, 
when accessed, change state or cause external events to occur. For example, 
some I/O devices contain registers that clear on read; others have registers that 
initiate operations when read. See also prefetchable. 


Single instruction stream, multiple data stream. 
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store 


strand 


strand identifier 
(SID) 


superscalar 


supervisor software 


thread 


TICK 
TL 
TPC 


trap 


TSB 


An instruction that writes (but does not explicitly read) memory or writes (but 
does not explicitly read) location(s) in an alternate address space. Store 
includes stores from either integer or floating-point registers, block stores, 
partial store, and alternate address space variants of those instructions. See also 
load and load-store, the definitions of which are mutually exclusive with store. 


A term for thread-specific hardware support that identifies the hardware state 
used to hold a software thread in order to execute it. Strand is specifically the 
software visible architected state (PC, NPC, general-purpose registers, floating- 
point registers, condition codes, status registers, ASRs, etc.) of a thread and any 
microarchitecture state required by hardware for its execution. "Strand" 
replaces the ambiguous term "hardware thread". The number of strands in a 
processor defines the number of threads that the operating system can 
schedule on that processor at any given time. See also thread, and virtual 
processor. 


An n-bit value, in a processor implementing 2" strands, that uniquely identifies 
each strand. The strand identifier in UltraSPARC Tl is five bits wide. 


An implementation that allows several instructions to be issued, executed, and 
committed in one clock cycle. 


Software that executes when the processor is in privileged mode. 


An executing process or lightweight process (LWP). Historically, the term 
thread is overused and ambiguous. Software and hardware have historically 
used it differently. From software's (operating system) perspective, the term 
thread refers to an entity that can be run on hardware, it is something that is 
scheduled and may or may not be actively running on hardware at any given 
time, and may migrate around the hardware of a system. From hardware's 
perspective, the term multithreaded processor refers to a processor that run 
multiple software threads simultaneously. To avoid confusion the term thread 
is used exclusively in the manner in which it is used by software and, 
specifically, the operating system. A thread can be viewed in a practical sense 
as a Solaris?" process or lightweight process (LWP). See also strand, and virtual 
processor. 


Hardware clock—TICK counter register. 
Trap Level 
Trap-saved PC. 


The action taken by the processor when it changes the instruction flow in 
response to the presence of an exception, a Tcc instruction, or an interrupt. The 
action is a vectored transfer of control to privileged or hyperprivileged 
software through a table. See also exception. 


Translation storage buffer. A table of the address translations that is 
maintained by software in system memory and that serves as a cache of the 
address translations. 
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TSO 
TTE 


unassigned 


undefined 


unimplemented 


unpredictable 
unrestricted 
user application 
program 

VA 


virtual address 


virtual processor 


Total store order. 


Translation table entry. Describes the virtual-to-physical translation and page 
attributes for a specific page in the Page Table. In some cases, the term is 
explicitly used for the entries in the TSB. 


A value (for example, an ASI number), the semantics of which are not 
architecturally mandated and which may be determined independently by 
each implementation within any guidelines given. 


An aspect of the architecture that has deliberately been left unspecified. 
Software should have no expectation of, or make any assumptions about, an 
undefined feature or behavior. Use of such a feature can deliver unexpected 
results, may or may not cause a trap, can vary among implementations, and 
can vary with time תס‎ 8 given implementation. 


Notwithstanding any of the above, undefined aspects of the architecture shall 
not cause security holes (such as allowing user software to access privileged 
state), put the processor into supervisor mode, or put the processor into an 
unrecoverable state. 


An architectural feature that is not directly executed in hardware because it is 
optional or is emulated in software. 


Synonym: undefined. 


Describing an address space identifier (ASI) that can be used regardless of the 
processor mode; that is, regardless of the value of PSTATE.priv. 


Synonym: application program. 
Virtual address. 


An address produced by a processor that maps all systemwide, program- 
visible memory. Virtual addresses usually are translated by a combination of 
hardware and software to real addresses. 


The term virtual processor, is used to identify each strand in a processor. Each 
virtual processor corresponds to a specific strand on a specific physical core 
where there may be multiple physical cores each with multiple strands. In 
most respects, a virtual processor appears to the system, and to the operating 
system software, as a processing unit equivalent to a traditional single- 
stranded microprocessor (as in UltraSPARC I). Each virtual processor has its 
own interrupt ID and the operating system can schedule independent threads 
on each virtual processor. How multiple virtual processors are achieved within 
a processor is an implementation issue, and as much as possible the software 
interface is independent of how multiple virtual processors are implemented. 
The term virtual processor is used in place of strand because of the common 
association of the term strand with multi-stranded physical cores. See also strand, 
and thread. 
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VIS™ Visual instruction set. 


word An aligned quadlet. Note: The definition of this term is architecture dependent 
and may differ from that used in other processor architectures. 


younger instruction See older instruction. 
writeback The process of writing a dirty cache line back to memory before it is refilled. 


WRPR Write Privileged Register. 
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Index 





A 
Accumulated Exception (aexc) field of FSR 
register, 65 
Address Mask (am), 60 
Address Mask (am) 
field of PSTATE register, 40, 41, 57 
address space identifier (ASI) 
bypass, 124 
definition, 123 
nontranslating, 128 
application program, 123 
ASI PRIMARY NO FAULT, 57 
ASI PRIMARY NO FAULT LITTLE, 57 
ASI SECONDARY NO FAULT, 57 
ASI SECONDARY NO FAULT LITTLE, 57 
atomic quad load instructions (deprecated), 27 






































B 
BA instruction, 99 
BCC instruction, 99 
BCS instruction, 99 
BE instruction, 99 
BG instruction, 99 
BGE instruction, 99 
BGU instruction, 99 
Bicc instructions, 99 
BL instruction, 99 
BLE instruction, 99 
BLEU instruction, 99 
block 
copy, inner loop pseudo-code, 20 
load instructions, 21 


memory operations, 70 


block-transfer ASIs, 22 
BN instruction, 99 
BNE instruction, 99 
BNEG instruction, 99 
BPA instruction, 100 
BPCC instruction, 100 
BPCS instruction, 100 
bpe instruction, 100 
BPG instruction, 100 
BPGE instruction, 100 
BPGU instruction, 100 
BPL instruction, 100 
BPLE instruction, 100 
BPLEU instruction, 100 
BPN instruction, 100 
BPNE instruction, 100 
BPNEG instruction, 100 
BPOS instruction, 99 
BPPOS instruction, 100 
BPr instructions, 100 
BPVC instruction, 100 
BPVS instruction, 100 
BVC instruction, 99 
BVS instruction, 99 
bypass ASI, 124 


6 
caching 

TSB, 55 
canrestore Register, 61 
cansave Register, 61 
clean window, 62 


clean window trap, 62 
cleanwin Register, 61 
CLEANWIN register, 62 
compatibility with SPARC V9 
terminology and concepts, 123 
conventions 
font, ix 
notational, x 
cross call, 71 
Current Exception (cexc) field of FSR register, 65 
current window pointer (CWP) register 
definition, 125 
cwp Register, 61 


D 
D superscript on instruction name, 13 
data watchpoint 
virtual address, 57 
data access exception trap, 24, 27, 28, 41, 57, 59, 
67 
deferred 
trap, 60 
Diagnostic (diag) field of TTE, 54 
Dirty Lower (dl) field of FPRS register, 64 
Dirty Upper (du) field of FPRS register, 64 
D-MMU, 57 
doublet, 125 
doubleword 
definition, 125 


E 


enhanced security environment, 60 
exceptions 

fp exception other, 12 

illegal instruction, 12 
extended 

instructions, 71 


F 
FABSd instruction, 98, 99 
FABSq instruction, 98, 99 
FBA instruction, 99 

FBE instruction, 99 
FBfcc instructions, 99 
FBG instruction, 99 
FBGE instruction, 99 


FBL instruction, 99 
FBLE instruction, 99 
FBLG instruction, 99 
FBN instruction, 99 
FBNE instruction, 99 
FBO instruction, 99 
FBPA instruction, 100 
FBPE instruction, 100 
FBPfcc instructions, 99 
FBPG instruction, 100 
FBPGE instruction, 100 
FBPL instruction, 100 
FBPLE instruction, 100 
FBPLG instruction, 100 
FBPN instruction, 100 
FBPNE instruction, 100 
FBPO instruction, 100 
FBPU instruction, 100 
FBPUE instruction, 100 
FBPUG instruction, 100 
FBPUGE instruction, 100 
FBPUL instruction, 100 
FBPULE instruction, 100 
FBU instruction, 99 
FBUE instruction, 99 
FBUG instruction, 99 
FBUGE instruction, 99 
FBUL instruction, 99 
FBULE instruction, 99 
FCMPd instruction, 99 
FCMPEd instruction, 99 
FCMPEg instruction, 99 
FCMPEs instruction, 99 
FCMPg instruction, 99 
FCMPs instruction, 99 
FdTOx instruction, 98, 99 
floating point 
deferred trap queue (fq), 66 
exception handling, 63 
trap type (ftt) field of FSR register, 65 
Floating Point Condition Code (fcc) 
0 (fccO) field of FSR register, 65, 66 
1 (fcc1) field of FSR register, 65 
2 (fcc2) field of FSR register, 65 
3 (fcc3) field of FSR register, 65 
field of FSR register in SPARC-V8, 66 
Floating Point Registers State (FPRS) Register, 64 
floating-point trap type (fit) field of FSR register, 12 
floating-point trap types 
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unimplemented FPop, 12 reserved, 12 





FLUSH instruction, 67 integer 

FMOVcc instructions, 100 division, 62 
FMOVccd instruction, 99 multiplication, 62 
FMOVecq instruction, 99 register file, 61 
FMOVccs instruction, 99 interrupt 

FMOVId instruction, 98, 99 packet, 71 
FMOVq instruction, 98, 99 request, 127 
FNEGd instruction, 98, 99 invalid. fp. register floating-point trap type, 66 
FNEGq instruction, 98, 99 Invert Endianness 

fo exception ieee 754trap, 65, 66 (ie) field of TTE, 54 
fo exception other exception, 12 IRF, 127 


fo exception othertrap, 59, 63, 65, 66 
fq, see floating-point deferred trap queue (fq) 





FqTOx instruction, 98, 99 L 

FRF, 126 LDDF mem address not aligned trap, 69 
FsTOXx instruction, 98, 99 LDQF instruction, 69 

FxTOd instruction, 98, 99 LDQFA instruction, 69 

FxTOq instruction, 98, 99 LDTW instruction, 69 

FxTOs instruction, 98, 99 little-endian 


byte ordering, 127 
load instructions, 127 


H load twin extended word instructions, 25 

hardware load twin extended word instructions 
interrupts, 71 (deprecated), 27 

hardware error floating-point trap type, 66 load-store instructions 


definition, 127 
Lock (l) field of TTE, 54 
l 
IEEE Std 754-1985, 65, 125 


IEEE support M 
inexact exceptions, 113 may (keyword), 128 
infinity arithmetic, 106 mem address not aligned trap, 24, 27, 28, 40, 57 
NaN arithmetic, 112 MEMBAR 
one infinity operand arithmetic, 107 #LoadStore, 19 
two infinity operand arithmetic, 110 #StoreLoad, 19 
zero arithmetic, 111 #StoreStore, 19,67 
IEEE 754 exception floating-point trap type, 66 #Sync, 18, 19 
IEEE 754 exception floating-point trap type, 125 memory 
illegal instruction exception, 12 model, 19 
illegal instruction trap, 40, 59, 66, 69, 71 MOVcc instructions, 100 
ILLTRAP instructions, 59 must (keyword), 128 
implementation note, xii M-way set-associative TSB, 55 


initiated, 127 
instruction fields 


definition, 127 N 
instruction set architecture (ISA), 126, 127 N REG WINDOWS, 61 
instruction access exception trap, 40, 41, 57 nested traps 


instructions in SPARC-V9, 60 


No-Fault Only (nfo) field of TTE, 54 R 





nonfaulting load, 57 reserved 
nonfaulting loads fields in opcodes, 59 
definition of, 128 instructions, 12, 59 
nonprivileged Rounding Direction (rd) field of FSR register, 66 
mode, 123 
Non-Standard (ns) field of FSR register, 65 
nontranslating ASI, 128 S 
note SAVE instruction, 62 
implementation, xii secure environment, 60 
programming, xii self-modifying code, 67 
NPC register, 41 sequence error floating-point trap type, 66 
shall (Keyword), 131 
short floating point 
0 load instruction, 0 
opcode store instruction, 70 
definition, 129 should (keyword), 131 
otherwin Register, 61 software 
out of range defined (soft) field of TTE, 54 
virtual address, 39 defined (soft2) field of TTE, 54 
virtual address, as target of JMPL or Translation Table, 55 
RETURN, 40 SPARC 
V9 compliance, 59 
SPARC V9 
P concepts and terminology, 123 
P superscript on instruction name, 13 speculative load, 57 
partial store STDF mem address not aligned trap, 69 
instruction, 70 store instructions, 132 
physical address (pa) STOF instruction, 69 
field of TTE, 54 STQFA instruction, 69 
population count (POPC) instruction, 60 STTW instruction, 69 


power down mode, 71 
precise traps, 60 


PREFETCHA instruction, 68 T 

privileged TA instruction, 99 
(priv) field of PSTATE register, 57 Tec instruction, reserved fields, 59 

privileged action trap, 57 Tec instructions, 99, 100 

programming note, xii TCS instruction, 99 

pstate, 19 TE instruction, 99 

PSTATE terminology for SPARC V9, definition of, 123 
priv field, 128, 129 TG instruction, 99 


TGE instruction, 99 
TGU instruction, 99 


Q tl instruction, 99 

quad-precision floating-point instructions, 63 tle instruction, 99 

quadword TLEU instruction, 99 
definition, 130 TN instruction, 99 


TNE instruction, 99 
TNEG instruction, 99 
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TPOS instruction, 99 
Translation Table Entry see TTE 
trap 
stack, 60 
state registers, 60 
Trap Enable Mask (tem) field of FSR register, 65, 65, 
66 
TSB, 27 
caching, 55 
organization, 55 
Register, 55 
TTE, 53 
TVC instruction, 99 
TVS instruction, 99 


U 
UltraSPARC-I 
extended instructions, 71 
unfinished_FPop floating-point trap type, 66 
unimplemented 
instructions, 59 
unimplemented_FPop floating-point trap type, 63, 
66 
unimplemented_FPop floating-point trap type, 12 


V 
VA Data Watchpoint register, 57 
VA watchpoint trap, 24, 28 
Version (ver) field of FSR register, 65 
virtual address 
space illustrated, 40 


W 

watchpoint trap, 57 

window filltrap, 40 
Writable (w) field of TTE, 54 
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